We are trying automate the entire loan application and processing. So, this involved a lot of character recognition stuff as our target group have their financial documents as hard copies. Helping them autofill their information would make their task easier as well as avoid human errors while typing. So, after reading a few articles, I first designed a OCR using google’s OCR library tesseract. The classifier produced good results when it came to reading standardised documents. But, as the complexity of the document grew, such as reading a cheque, it became challenging to achieve considerable accuracy. So, to avoid the complexities of training a custom classifier and deploying it on the cloud (which would require significant amount of computations) we decided to use Microsoft Azure’s Vision API. It provided us the coordinates of all the texts and all we had to do was look for texts similar to an Account number and IFSC from a cheque book. Using some regex it was easy to find closely matching strings.
Later we extended this to read bank statements, this is where even Azure failed to read everything in the image. We had tried google vision’s API earlier but the output wasn’t satisfactory. So, decided to work on making the image more readable. I came across a lot of image filters whose main motive was to convert the image to only black and white, no other colours. I tried out a lot of them, some of them were the mean, median and gaussian thresholding. The one which worked best for us was a custom designed filter using Otsu’s Thresholding principle.
Adrian here, author of the PyImageSearch blog. I'll add doing a tutorial on cheque recognition (at least the routing and account numbers) to my queue. Thanks for the great suggestion.
One possible alternative solution is to chop the image into smaller images (with something like ImageMagick) based on each value's likely location in the document, then OCR those. You get a confidence interval with tesseract, so you can iterate over possible templates (or shrink/expand crops) until you get an [edit: aggregate] interval you're comfortable with.
Except for the size of cheque and the position of magnetic characters, none of the text on cheques is standardised in India. Hence we might stand a chance of chopping characters
>It provided us the coordinates of all the texts and all we had to do was look for texts similar to an Account number and IFSC from a cheque book. Using some regex it was easy to find closely matching strings
Could you explain what you mean by this ? We are trying to read shopping receipts, but I have ZERO background in image processing... so have been trying to figure out what to do. I have been trying to use Google Vision API though.
>The one which worked best for us was a custom designed filter using Otsu’s Thresholding principle.
if you are thinking of using OCR for this, i would suggest SikiliX. It is a Tesseract-based automation tool that is written in Java, but has Jython bindings. I have used it before, and loved it.
Teaseract is ok printed material that's neatly organized, but other than that it seems the only other programmatic ocr is google cloud vision. It's a hundred times better, but unfortunately I need to OCR documents I can't contractual show the mighty G
In the "better than Tesseract" category is also Microsoft Azure OCR (not as good as Google) and the OCR.space OCR API (also not as good as Google, but 100* times cheaper/free, and supports PDF).
The best - and most expensive - solution is still Abbyy OCR. They provide an SDK than can be used locally.
A new local OCR solution is Anyline.io, but I have not used them yet.
How did you get Copyfish to play nice with Zhongwen/Perapera? I've tried it with Chrome and Firefox and nothing seems to get them to pick up on the OCR text.
I'm trying to read things like street signs, speed limits, store names, from not-necessarily-axis-aligned pictures - so far it seems only Google OCR can do those (and does them quite well). Is Abbyy worth trying for that use?
I'm combining opencv and pytesseract in order to process some scanned forms. Doing this I was able to link 70k forms to a database previously filled by professional typists. Now I have a huge data set I can use to train ML algorithms, I'm experimenting with several of them.
I have no formal training in CV, so my impression is that recognition is relatively easy, the hard thing is the preprocessing need in order to normalize images.
There's a number of steps you'll need to figure out. for the 70k forms, where do the fields come from. Then for every scan, finding the bounding box for every field in a somewhat automated manner. You can use histograms and blob detection to help out with a number of these.
Once you have thresholded text boxes that are quite legible, you can train your CNN's and LSTMS to read text from images.
Is there a way to combine the character-level OCR with knowledge of the English dictionary? Something like `pregrarrmung` should be able to map to 'programming' especially with n-gram context of pregrarrmung experience.
If you really want to test an OCR tools, try scanning something like an XBox Live membership card. The fonts on those cards seems to be specifically designed to mess with OCR.
I used tesseract/pytesseract, almost perfect pre processing using blur, otsu etc,
But for get good results, you need big images, 300 dpi+ are needed,
The big images make it is too slow,
Maybe i should have try segmentation the caracters before using the ocr,
I endeup making my ocr from scratch, using averages etc, and it is almost instant, and i am happy with it.
I am currently trying out tesseract/pytesseract on shop receipts but I have not being able to get meaningful result . I have tried adaptive Gaussian and mean threshold, I have also tried blur . But no joy yet. You mentioned building from scratch, how ? And what is your minimum size ?
I created a set of Python bindings to Tesseract a couple of years ago. While not complete, they would likely make a great starting point for anyone wanting to interface with it at a deeper level. Reminds me, I should do some modernization work on it. https://github.com/blindsightcorp/tesserpy
I don't see why not. The bindings are automatically generated for the most part.
However, this does not mean that all functionality will be available from Python, especially when code generation is not enough.
The image stitching library for example hits an assertion failure when called from Python. Disabling the check appears to work, but then you get warnings about incorrect reference counts.
for preprocessing you can just use pillow for thresholding,rgb<-> gray conversions etc. While opencv gives much more option its a heavy library to use for this kind of functionality.