In our use case, we've mostly had to deal with handwritten text and that's where none of them really did well. Your next best bet would be to use HoG(Histogram of oriented gradients) along with SVMs. OpenCV has really good implementations of both.
Even then, we've had to write extra heuristics to disambiguate between 2 and z and s and 5 etc. That was too much work and a lot of if-else. We're currently putting in our efforts on CNNs(Convolutional Neural Networks). As a start, you can look at Torch or Caffe.
But the existing frameworks are mostly bad as an engineering product. Caffe, for example, calls `exit` every time an error occurs, including recoverable ones like files which do not exist.
And for comparison, an OCR application with Tesseract inside: It has a dramatically lower text recognition rate: http://blog.a9t9.com/p/free-ocr-windows.html
(Disclaimer: both links are my little open-source side projects)
Overall very good, I'm just wondering if the library is better with image files than PDFs?
The OCR library itself supports only image formats as input and is "innocent" with regards to this issue ;)
Much much better than what I can get with tesseract. Would love to have it as an API service.
When I compared Tesseract to Abbyy, the difference was night and day. Abbyy straight out of the box got me 80%-90% accuracy of my text. Tesseract got around 75% at best with several layers deep of image pre-processing.
I know you said open source, and just wanted to say, I went down that path too and discovered in my case, proprietary software really was worth the price.
And the kicker: You can't buy those licenses from them directly. They put you in contact with some randome monopoly local distributor that usually has mandatory "training" charges.
Messy, and I'm planning on staying away from that with a ten-foot pole.
All do-able within Linux.
I seriously doubt that it's just $100 -- but, I guess if you are making an application that you are just going to run yourself on a single machine, it would not cost much. The costs tended to increase with the number of computers you were deploying to. Or perhaps we used something more capable (does that come with all languages -- including Chinese/Japanese/Korean and RTL languages).
What kind of problem were you dealing with? Were you bumping up against any constraints?
I think we opted for the customer actually obtaining their Abbyy license from Abbyy in the end because of the licensing mis-match. We sold just the wrapper.
Our parent company was a big user of Abbyy, and I think had a totally custom deal. They needed it for all of the language support and similarly wanted the full power to run it at high-speed and inside of .NET programs.
They also sold Recognition Server which has fewer options to integrate with programmatically (it was only 'hot folders' at the time), and I think only runs on Windows, but costs less.
And their mobile OCR which is lightweight, designed to run on smartphones, they worked out deals where they get a percentage of revenue.
I have a hobby project where I scrape instagram photos, and I actually only want to end up with photos with actual people in them. There are a lot of images being posted with motivational texts etc that I want to automatically filter out.
So far I've already built a binary that scans images and spits out the dominant color percentage, if 1 color is over a certain percentage (so black background white text for example), I can be pretty sure it's not something I want to keep.
I've also tried OpenCV with facial recognition but I had a lot of false positives with faces being recognized in text and random looking objects, and I've tried out 4 of the haarcascades, all with different, but not 'perfect' end results.
OCR was my next step to check out, maybe I can combine all the steps to get something nice. I was getting weird texts back from images with no text, so the pre-processing hints in this thread are gold and I can't wait to check those out.
This thread is giving me so much ideas and actual names of algorithms to check out, I love it. But I would really appreciate it if anyone else has more thoughts about how to filter out images that do not contain people :-)
1. Use an LBP cascade on the picture. This is lower quality, higher false positives. Uses integer math so this is fast. Its named lbpcascade_frontalface.xml
2. Capture the regions of interest that the LBP cascade identifies with a face and throw into a vector<Mat>. This means you can capture (potentially) arbitrary amount of faces. Of course, with OpenCV you are limited to a minimum of 28px by 28px minimum face.
3. Run the haar cascade for eye detection on the ROI's you saved in the vector. Ones that return eyes show a good match. Haar cascades are slower(because they use floats), but the reduction in pixels means its relatively fast. Its named haarcascade_eye_tree_eyeglasses.xml
I can maintain 20fps with this setup at 800x600 on a slow computer.
I also like your dominant-color thing. If you have a couple approaches that each make sense, you can use them as an ensemble - the more classifiers that don't like something, the more likely it's junk.
It's a lot of manual work, but using OpenCV saves you a lot of time. I can't share the code unfortunately, but what i did was this:
* Get all Instagram photos with the '#selfie' tag
* Run it all through the haarcascade_frontalface_alt2 OpenCV cascade, i used the 1.3 and 5 values for the detectMultiScale() method.
* Check that there's only one face in the image, and make sure it's larger than 20% of the width of the image.
Even after that i still needed to go manually through the images. I guess around 10% was still false positives.
Google allows image searches to be filtered by is/isn't a face. I think you could tap into that knowledge, although it isn't immediately clear what the route would be.
Here's a detailed rundown of the more obscure google search paramters: https://stenevang.wordpress.com/2013/02/22/google-search-url...
The relevant one to your interest is "tbs=itp:face".
It's a brilliant piece of software for a number of things.
If you want to do text extraction, look at things like Stroke Width Transform to extract regions of text before passing them to Tesseract.
There are samples here  and here  to get you started. The paper is here: 
In summary, you need templates that map the field positions -> meaningful keys so that you can get back useful data as json/csv/xml. I have some tools that are still being polished that automate much of the template creation and do a lot of the pre-proc for you.
email is my username (at) gmail
No, a whole lot of pre-processing would be needed. It all depends on the exact layout - if your tolerances are tight you need much more logic than if you have, let's say, 2cm white space around one sentence you're after.
We open-sourced the library that we use for exactly that purpose: https://github.com/creatale/node-fv
It's pretty sad considering that OCR is basically a solved problem. We have neural nets that are capable of extracting entities in images and bigdata systems that can play jeopardy. But no one has used that tech for OCR and put it out there.
It's kind of the same reason there is no good open source solid modelling software (like Solidworks or NX). The problem has been "solved" for years, but actually doing it is too mammoth a task for open-source software.
When they get good GUI hooked up to the calls, it will be in parity with the pay solutions.
A lot of this comes down to how much time you're willing to put into it to get it up and running and if you're willing to put in any effort to additionally train/refine a model.
(ps all the elements I'm about to mention here have been mentioned throughout this thread. I'm just providing a bit more context and bringing what I know together).
Gotta have something now: Tesseract
Not saying it's "bad" but it ain't as great as you want. Bindings for every language, lots of configs, documented well enough that it's 80% of the suggestions here. For what it is, it's a damn miracle. Most languages and alphabets and lots of kinds of fonts likely start giving you results somewhere between 80% and the Uncanny Valley immediately. You also get a pretty good Layout Analysis engine  for if you're working with complete pages of text. The existing models it ships with are robust, but if you want to get better outputs, retraining it is a real pain (you have to segment on a character-by-character basis). You're better off trying to apply a general preprocessing to clean the image or...
Gotta have a proof of concept this afternoon: Send just text to Tesseract
By now you've realized that asking an OCR engine to OCR a tree (or a picture of a tree alongside some text) somehow always comes back looking like a cat just napped on your keyboard. As alluded to in a few other places here, Just Give It Text! Tesseract (and most traditional OCR) was developed in a text document centric world where you could safely assume it's just a page of words, and sometimes you might have to deal with columns (that's where the Layout Analysis comes in). Probably not your real world.
Depending on your type of documents you might be able to develop a few hueristics for identifying text regions in the picture, then sending only those sections over to Tesseract. This'll dramaticlly help Tesseract out, but it can increase some of the complexity you'll have to juggle. You might be able to come up with some hueristics specific to your documents (which can be very good, especially if it lets you infer more information about those regions that you might want later).
You can also use something like Stroke-Width Transform (all the links I would use were graciously linked in this earlier comment ) which was discovered by Microsoft trying to spot text in the wild for their Street View efforts. ccv has a very nice SWT implementation , and their http server for the whole library  that with a bit of makefile finagleing can have a very nice SWT preprocessor -> Tesseract api over HTTP up in an hour or so.
Also it looks like the OpenOCR project is now using SWT -> Tesseract with a nice HTTP API written in Go, conveniently packaged up for Docker and very well documented.
Take me down the rabbit hole, but maybe get results in 2 days:
The roots of tesseract were planted by HP in 1985, so there had to be a better way at some point. OCRopus was supposed to be the great state-of-the-art hope that would save us all. The approach was incredible and the results being published were great. But the documentation came in the form of a mind map. Recently the project was picked up again by the original developer and rechristened OCRopy and has garnered a pretty active and growing community in the past few months.
You'll have to do a bit more work here than just tesseract, but the LSTM neural network approach completely blows away Tesseract's results with just a little training.
To get started, Dan Vanderkam's tutorial is excellent to start working with the out-of-the-box model immediately.
But results get INCREDIBLE when you take some time to train your own model. I provided Dan with the source images for the project, from The New York Public Library's historical collections, and the source text was in a font style the default model had never encountered before. But in an hour or two of transcribing the training data (training data in OCRopus is awesome because you just feed in a line at a time; no need to align and segment each letter like in Tesseract), he was getting an under 1% error rate!
Layout analysis isn't OCRopy's strongest suit, so you might get even better results if you pre-segment with something like SWT, but again, not completely necessary.
That's it. Way too much for a Newsy Combinator comment, but a pretty decent tour of the big stuff in open ocr world. There's some magic state of the art going on inside Google, Microsoft, and Abbyy, but hopefully you'll be able to teach your computer to appreciate whatever it is you want it to read through these now.
: http://libccv.org/doc/doc-swt/ and http://libccv.org/lib/ccv-swt/
: http://libccv.org/doc/doc-http/ and http://libccv.org/lib/ccv-serve/
: https://github.com/tmbdev/ocropy | all day all week
: http://www.danvk.org/2015/01/09/extracting-text-from-an-imag... | though you've got to download the model separately because it's too big for github
It's designed to work primarily with old fire insurance atlases (e.g. Sanborns) and is a bunch of hand-tuned heuristics. But it powers Building Inspector (http://buildinginspector.nypl.org), which is where all the data is validated consensus crowdsourcing, providing ground truth for the data coming out.
Unfortunately (or perhaps fortunately if you're trying to get a Computer Science PhD) I haven't come across anyone applying deep learning to map vectorization either. Frankly, it could open up a whole new field with respect to historical mapping (among other things).
Would love to talk more about it and see how deep learning could be applied here. We always structured the outputs of Building Inspector so they'd be useful training sets for unsupervised or reinforcement learning so it could hopefully apply here as well.
If you're doing real time street sign recognition or involved with a book scanning and archival startup, investigate Tesseract for sure. But even then you'll probably want to prototype with gocr first.
As you mentioned ocrad.js I assume you search for something in js/nodejs. Many others already recommended tesseract and OpenCV.
We have built a library around tesseract for character recognition and OpenCV (and other libraries) for preprocessing, all for node.js/io.js: https://github.com/creatale/node-dv
If you have to recognize forms or other structured images, we also created a higher-level library: https://github.com/creatale/node-fv
There are options to adjust the image in various ways and once Tesseract runs, it's easy to get the result in various formats.
So if the source image contains text columns or pull quotes or similar, the output text will just be each row of text, from the far left to the far right.
I'd make a cascade that detects all letters and numbers from major font sets. That shouldn't be too terribly difficult.
Now, use the cascade to scan the document. Now, convert the document to a list of all detected characters (we don't actually care what the chars are).
Once you have this, do best fit bounding boxes around the data. You'll have to figure out what distance you want to exclude from the bounding boxes.
Now what you should end up with are a few boxes indicating the regions of data on the document. Now, crop each of these regions of interest and feed them into Tesseract.
Our particular application was OCRing brick and mortar store receipts directly from emulated printer feeds (imagine printing straight to PDF). We found that Tesseract had too many built-in goodies for image scanning, like warped characters, lighting and shadow defects, and photographic artifacts. When applied directly to presumably 1 to 1 character recognition, it failed miserably.
We found that building our own software to recognize the characters on a 1 to 1 basis produced much better results. See: http://stackoverflow.com/questions/9413216/simple-digit-reco...
With the caveat that none of the stuff was handwritten.
Unfortunately the native Linux version is a bit pricey:
Otherwise I would use the command line version to help me index all my data.
Examples here: http://funkybee.narod.ru/
Apologies, I am not sure if its open source.
Caffe has been used for handwriting, seems like OCR of typefaces would work just the same with a typeface dataset.
Here is the only other OCR example with caffe i could find:
This dude did a pretty nice writeup on getting it going: http://gaut.am/making-an-ocr-android-app-using-tesseract/