
Using Tesseract OCR with Python - jonbaer
http://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/
======
kumartanmay
We are trying automate the entire loan application and processing. So, this
involved a lot of character recognition stuff as our target group have their
financial documents as hard copies. Helping them autofill their information
would make their task easier as well as avoid human errors while typing. So,
after reading a few articles, I first designed a OCR using google’s OCR
library tesseract. The classifier produced good results when it came to
reading standardised documents. But, as the complexity of the document grew,
such as reading a cheque, it became challenging to achieve considerable
accuracy. So, to avoid the complexities of training a custom classifier and
deploying it on the cloud (which would require significant amount of
computations) we decided to use Microsoft Azure’s Vision API. It provided us
the coordinates of all the texts and all we had to do was look for texts
similar to an Account number and IFSC from a cheque book. Using some regex it
was easy to find closely matching strings. Later we extended this to read bank
statements, this is where even Azure failed to read everything in the image.
We had tried google vision’s API earlier but the output wasn’t satisfactory.
So, decided to work on making the image more readable. I came across a lot of
image filters whose main motive was to convert the image to only black and
white, no other colours. I tried out a lot of them, some of them were the
mean, median and gaussian thresholding. The one which worked best for us was a
custom designed filter using Otsu’s Thresholding principle.

~~~
zebra9978
is this for US banks ? I'm assuming not.

> _It provided us the coordinates of all the texts and all we had to do was
> look for texts similar to an Account number and IFSC from a cheque book.
> Using some regex it was easy to find closely matching strings_

Could you explain what you mean by this ? We are trying to read shopping
receipts, but I have ZERO background in image processing... so have been
trying to figure out what to do. I have been trying to use Google Vision API
though.

> _The one which worked best for us was a custom designed filter using Otsu’s
> Thresholding principle._

Is this where you pre-preocess the image to make it readable ? How does one do
it - are these specialized tools or can I do this in python (like
[http://www.scipy-lectures.org/packages/scikit-
image/auto_exa...](http://www.scipy-lectures.org/packages/scikit-
image/auto_examples/plot_threshold.html))

~~~
abc03
For receipts, I recommend: [http://ocrsdk.com/](http://ocrsdk.com/) which is
the online product of Abbyy. They also have a blog post giving you some ideas:
[http://blog.ocrsdk.com/top-5-pains-for-developers-in-
receipt...](http://blog.ocrsdk.com/top-5-pains-for-developers-in-receipt-ocr/)
Has anyone successfully implemented it themselves for receipts or invoices?
What was your strategy?

------
m_ke
If you plan on using tesseract definitely try out their 4.0 beta, which uses
LSTMs.

[https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-
LST...](https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM)

~~~
beagle3
From the Wiki,

> The Tesseract 4.00 neural network subsystem is integrated into Tesseract as
> a line recognizer.

The LSTM is used in layout analysis, not in character recognition.

------
sirfz
Note that pytesseract has no bindings with tesseract's API, it merely uses the
tesseract command line tool and communicates with it via stdout/stdin.

If you want native and complete access to tesseract's API you can use
tesserocr:
[https://github.com/sirfz/tesserocr](https://github.com/sirfz/tesserocr)

------
beagle3
Teaseract is ok printed material that's neatly organized, but other than that
it seems the only other programmatic ocr is google cloud vision. It's a
hundred times better, but unfortunately I need to OCR documents I can't
contractual show the mighty G

~~~
RandomBookmarks
In the "better than Tesseract" category is also Microsoft Azure OCR (not as
good as Google) and the OCR.space OCR API (also not as good as Google, but
100* times cheaper/free, and supports PDF).

The best - and most expensive - solution is still Abbyy OCR. They provide an
SDK than can be used locally.

A new local OCR solution is Anyline.io, but I have not used them yet.

~~~
beagle3
I'm trying to read things like street signs, speed limits, store names, from
not-necessarily-axis-aligned pictures - so far it seems only Google OCR can do
those (and does them quite well). Is Abbyy worth trying for that use?

~~~
maxerickson
No API, but mapillary is doing that with machine learning:

[http://blog.mapillary.com/product/2017/02/06/towards-
global-...](http://blog.mapillary.com/product/2017/02/06/towards-global-
traffic-sign-recognition.html)

It seems likely that Google is doing something similar.

------
scardine
I'm combining opencv and pytesseract in order to process some scanned forms.
Doing this I was able to link 70k forms to a database previously filled by
professional typists. Now I have a huge data set I can use to train ML
algorithms, I'm experimenting with several of them.

I have no formal training in CV, so my impression is that recognition is
relatively easy, the hard thing is the preprocessing need in order to
normalize images.

~~~
nojvek
There's a number of steps you'll need to figure out. for the 70k forms, where
do the fields come from. Then for every scan, finding the bounding box for
every field in a somewhat automated manner. You can use histograms and blob
detection to help out with a number of these.

Once you have thresholded text boxes that are quite legible, you can train
your CNN's and LSTMS to read text from images.

------
mrevolution
Few days ago I've written a python code to process a PDF file with Tesseract
or Google Vision API:
[https://github.com/lucab85/PDFtoTXT](https://github.com/lucab85/PDFtoTXT)

------
mrweasel
If you really want to test an OCR tools, try scanning something like an XBox
Live membership card. The fonts on those cards seems to be specifically
designed to mess with OCR.

------
jk2323
I know Tesseract is OS but when I tried it, it was nearly useless.

I use Abbyy with WINE. But a native Linux shell version of Abbyy is available:
[http://www.ocr4linux.com/en:start](http://www.ocr4linux.com/en:start)

~~~
edoloughlin
I've found Tesseract to be really good for, e.g., bank/cc statements, as long
as I scan at 600dpi.

------
Unhackable
Here are two of my blog post in regards to using OCR to bypass some security
mechanisms:

In short: It's a python code where you press one button and it will take a
screen shot, crop the image, decode it, and type in at over 900+ rpm.

[https://anthonys.io/ocr-engine-playground/](https://anthonys.io/ocr-engine-
playground/)

To see how it is in action without the OCR functions:

[https://anthonys.io/keybr-com-multiplayer-
cheater/](https://anthonys.io/keybr-com-multiplayer-cheater/)

------
pgodzin
Is there a way to combine the character-level OCR with knowledge of the
English dictionary? Something like `pregrarrmung` should be able to map to
'programming' especially with n-gram context of pregrarrmung experience.

~~~
ageitgey
Yep, it's called adding a language model.

Check out this paper (2011) for a good summary of the pros and cons:
[https://research.google.com/pubs/pub36984.html](https://research.google.com/pubs/pub36984.html)

~~~
pgodzin
This is great, thanks! I wonder how it would do with a more state-of-the-art
NN model rather than relying on word frequency as a model.

------
squidpickles
I created a set of Python bindings to Tesseract a couple of years ago. While
not complete, they would likely make a great starting point for anyone wanting
to interface with it at a deeper level. Reminds me, I should do some
modernization work on it.
[https://github.com/blindsightcorp/tesserpy](https://github.com/blindsightcorp/tesserpy)

------
rhlala
I used tesseract/pytesseract, almost perfect pre processing using blur, otsu
etc, But for get good results, you need big images, 300 dpi+ are needed, The
big images make it is too slow, Maybe i should have try segmentation the
caracters before using the ocr, I endeup making my ocr from scratch, using
averages etc, and it is almost instant, and i am happy with it.

~~~
ejanus
I am currently trying out tesseract/pytesseract on shop receipts but I have
not being able to get meaningful result . I have tried adaptive Gaussian and
mean threshold, I have also tried blur . But no joy yet. You mentioned
building from scratch, how ? And what is your minimum size ?

------
niyazpk
(Hopefully) Related question: What is the state of the art in OCR on
photographs? Is there something like the inception model for OCR?

~~~
ersinesen
Google's Attention OCR in tensorflow:

[https://github.com/tensorflow/models/tree/master/attention_o...](https://github.com/tensorflow/models/tree/master/attention_ocr)

------
my_first_acct
A tangentially related question: Will OpenCV (used under the hood in this
example) continue to support Python bindings in future versions?

~~~
yorwba
I don't see why not. The bindings are automatically generated for the most
part.

However, this does not mean that all functionality will be available from
Python, especially when code generation is not enough.

The image stitching library for example hits an assertion failure when called
from Python. Disabling the check appears to work, but then you get warnings
about incorrect reference counts.

------
sandGorgon
Does anyone know if there are tools to detect Photoshopped/forged/modified
images?

------
newusertoday
for preprocessing you can just use pillow for thresholding,rgb<-> gray
conversions etc. While opencv gives much more option its a heavy library to
use for this kind of functionality.

~~~
ejanus
Any link?

------
holtalanm
I am surprised at the lack of any mention of SikuliX.

~~~
nojvek
Sikuli uses tesseract under the hood with opencv template matching. How is
that going to be helpful?

