Hacker News new | past | comments | ask | show | jobs | submit login
Using Tesseract OCR with Python (pyimagesearch.com)
195 points by jonbaer on July 11, 2017 | hide | past | favorite | 47 comments

We are trying automate the entire loan application and processing. So, this involved a lot of character recognition stuff as our target group have their financial documents as hard copies. Helping them autofill their information would make their task easier as well as avoid human errors while typing. So, after reading a few articles, I first designed a OCR using google’s OCR library tesseract. The classifier produced good results when it came to reading standardised documents. But, as the complexity of the document grew, such as reading a cheque, it became challenging to achieve considerable accuracy. So, to avoid the complexities of training a custom classifier and deploying it on the cloud (which would require significant amount of computations) we decided to use Microsoft Azure’s Vision API. It provided us the coordinates of all the texts and all we had to do was look for texts similar to an Account number and IFSC from a cheque book. Using some regex it was easy to find closely matching strings. Later we extended this to read bank statements, this is where even Azure failed to read everything in the image. We had tried google vision’s API earlier but the output wasn’t satisfactory. So, decided to work on making the image more readable. I came across a lot of image filters whose main motive was to convert the image to only black and white, no other colours. I tried out a lot of them, some of them were the mean, median and gaussian thresholding. The one which worked best for us was a custom designed filter using Otsu’s Thresholding principle.

Adrian here, author of the PyImageSearch blog. I'll add doing a tutorial on cheque recognition (at least the routing and account numbers) to my queue. Thanks for the great suggestion.

One possible alternative solution is to chop the image into smaller images (with something like ImageMagick) based on each value's likely location in the document, then OCR those. You get a confidence interval with tesseract, so you can iterate over possible templates (or shrink/expand crops) until you get an [edit: aggregate] interval you're comfortable with.

Thanks for the suggestion. Will try and share the results here

Except for the size of cheque and the position of magnetic characters, none of the text on cheques is standardised in India. Hence we might stand a chance of chopping characters

is this for US banks ? I'm assuming not.

>It provided us the coordinates of all the texts and all we had to do was look for texts similar to an Account number and IFSC from a cheque book. Using some regex it was easy to find closely matching strings

Could you explain what you mean by this ? We are trying to read shopping receipts, but I have ZERO background in image processing... so have been trying to figure out what to do. I have been trying to use Google Vision API though.

>The one which worked best for us was a custom designed filter using Otsu’s Thresholding principle.

Is this where you pre-preocess the image to make it readable ? How does one do it - are these specialized tools or can I do this in python (like http://www.scipy-lectures.org/packages/scikit-image/auto_exa...)

For receipts, I recommend: http://ocrsdk.com/ which is the online product of Abbyy. They also have a blog post giving you some ideas: http://blog.ocrsdk.com/top-5-pains-for-developers-in-receipt... Has anyone successfully implemented it themselves for receipts or invoices? What was your strategy?

I can probably help, send me an email.

if you are thinking of using OCR for this, i would suggest SikiliX. It is a Tesseract-based automation tool that is written in Java, but has Jython bindings. I have used it before, and loved it.

is it possible to share the images that did not work for you?

We are majorly facing challenge with cheques and bank statements with noisy background. e.g. those of HDFC bank.

Out of interest, with cheques are you facing problems with sort code etc too, if so, I was just wondering, don't they use magnetic ink for those.

No, magnetic ink is only used at the bottom of the cheque for banking systems to identify the source of cheque.

If you plan on using tesseract definitely try out their 4.0 beta, which uses LSTMs.


From the Wiki,

> The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer.

The LSTM is used in layout analysis, not in character recognition.

Is it faster with LSTMs? Or just less errors?

Note that pytesseract has no bindings with tesseract's API, it merely uses the tesseract command line tool and communicates with it via stdout/stdin.

If you want native and complete access to tesseract's API you can use tesserocr: https://github.com/sirfz/tesserocr

Teaseract is ok printed material that's neatly organized, but other than that it seems the only other programmatic ocr is google cloud vision. It's a hundred times better, but unfortunately I need to OCR documents I can't contractual show the mighty G

In the "better than Tesseract" category is also Microsoft Azure OCR (not as good as Google) and the OCR.space OCR API (also not as good as Google, but 100* times cheaper/free, and supports PDF).

The best - and most expensive - solution is still Abbyy OCR. They provide an SDK than can be used locally.

A new local OCR solution is Anyline.io, but I have not used them yet.

Sorry to hijack this but I have a question about your comment here: https://news.ycombinator.com/item?id=14441748

How did you get Copyfish to play nice with Zhongwen/Perapera? I've tried it with Chrome and Firefox and nothing seems to get them to pick up on the OCR text.

I'm trying to read things like street signs, speed limits, store names, from not-necessarily-axis-aligned pictures - so far it seems only Google OCR can do those (and does them quite well). Is Abbyy worth trying for that use?

No API, but mapillary is doing that with machine learning:


It seems likely that Google is doing something similar.

I can probably help you with that, send me an email.

I thought I remembered seeing that you could read documents with IBM's Watson APIs? Anybody tried that?

I'm combining opencv and pytesseract in order to process some scanned forms. Doing this I was able to link 70k forms to a database previously filled by professional typists. Now I have a huge data set I can use to train ML algorithms, I'm experimenting with several of them.

I have no formal training in CV, so my impression is that recognition is relatively easy, the hard thing is the preprocessing need in order to normalize images.

There's a number of steps you'll need to figure out. for the 70k forms, where do the fields come from. Then for every scan, finding the bounding box for every field in a somewhat automated manner. You can use histograms and blob detection to help out with a number of these.

Once you have thresholded text boxes that are quite legible, you can train your CNN's and LSTMS to read text from images.

Few days ago I've written a python code to process a PDF file with Tesseract or Google Vision API: https://github.com/lucab85/PDFtoTXT

Is there a way to combine the character-level OCR with knowledge of the English dictionary? Something like `pregrarrmung` should be able to map to 'programming' especially with n-gram context of pregrarrmung experience.

Yep, it's called adding a language model.

Check out this paper (2011) for a good summary of the pros and cons: https://research.google.com/pubs/pub36984.html

This is great, thanks! I wonder how it would do with a more state-of-the-art NN model rather than relying on word frequency as a model.

If you really want to test an OCR tools, try scanning something like an XBox Live membership card. The fonts on those cards seems to be specifically designed to mess with OCR.

I used tesseract/pytesseract, almost perfect pre processing using blur, otsu etc, But for get good results, you need big images, 300 dpi+ are needed, The big images make it is too slow, Maybe i should have try segmentation the caracters before using the ocr, I endeup making my ocr from scratch, using averages etc, and it is almost instant, and i am happy with it.

I am currently trying out tesseract/pytesseract on shop receipts but I have not being able to get meaningful result . I have tried adaptive Gaussian and mean threshold, I have also tried blur . But no joy yet. You mentioned building from scratch, how ? And what is your minimum size ?

Here are two of my blog post in regards to using OCR to bypass some security mechanisms:

In short: It's a python code where you press one button and it will take a screen shot, crop the image, decode it, and type in at over 900+ rpm.


To see how it is in action without the OCR functions:


I created a set of Python bindings to Tesseract a couple of years ago. While not complete, they would likely make a great starting point for anyone wanting to interface with it at a deeper level. Reminds me, I should do some modernization work on it. https://github.com/blindsightcorp/tesserpy

(Hopefully) Related question: What is the state of the art in OCR on photographs? Is there something like the inception model for OCR?

Depends in the text is in a static location, it can be done easily.

A tangentially related question: Will OpenCV (used under the hood in this example) continue to support Python bindings in future versions?

I don't see why not. The bindings are automatically generated for the most part.

However, this does not mean that all functionality will be available from Python, especially when code generation is not enough.

The image stitching library for example hits an assertion failure when called from Python. Disabling the check appears to work, but then you get warnings about incorrect reference counts.

Does anyone know if there are tools to detect Photoshopped/forged/modified images?

for preprocessing you can just use pillow for thresholding,rgb<-> gray conversions etc. While opencv gives much more option its a heavy library to use for this kind of functionality.

Any link?

I am surprised at the lack of any mention of SikuliX.

Sikuli uses tesseract under the hood with opencv template matching. How is that going to be helpful?

I know Tesseract is OS but when I tried it, it was nearly useless.

I use Abbyy with WINE. But a native Linux shell version of Abbyy is available: http://www.ocr4linux.com/en:start

I've found Tesseract to be really good for, e.g., bank/cc statements, as long as I scan at 600dpi.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact