
A guide to OCR with Tesseract, OpenCV and Python - ole_gooner
https://nanonets.com/blog/ocr-with-tesseract/
======
sandreas
The preprocessing step uses otsu, which is pretty inaccurate, because it uses
only one single threshold value for the whole image. An adaptive thresholding
algorithm (like Sauvola or Wolf binarization) could improve the whole
preprocessing A LOT on many images, that are not only black and white. See
[https://github.com/chriswolfvision/local_adaptive_binarizati...](https://github.com/chriswolfvision/local_adaptive_binarization)
for details.

Other nice resources: \-
[https://www.researchgate.net/publication/306352164_Watershed...](https://www.researchgate.net/publication/306352164_Watershed_algorithm_based_segmentation_for_handwritten_text_identification)
\-
[https://isi.edu/integration/papers/chiang11-icdar.pdf](https://isi.edu/integration/papers/chiang11-icdar.pdf)

------
onemorelizard
The article is pretty great. Most tutorials I've found online working with OCR
basically run you through the installation process and a basic few CLI
commands or an introduction to their C++ API. This one takes you through some
interesting details like the bounding box info, template matching bits and
playing around with the config. The training process for tesseract, though not
included in this seems like a task.

------
m3nu
Automatically finding specific boxes/fields is quite interesting. I maintain a
Python package[1] that processes invoices using a template/regex-based
approach. It works alright, but eventually runs into some limitations. The
box-model from the article could push it further.

1:
[https://github.com/invoice-x/invoice2data](https://github.com/invoice-x/invoice2data)

~~~
mpeg
Hey this is great, I made something ad-hoc to do this for a client and might
borrow some ideas to improve it.

I heavily leaned on AWS Textract for the bounding boxes though, as the kind of
data I had to extract didn't have very well defined fields. I used some of the
techniques described in this link [0] particularly around table extraction.

I really like how you define the fields in YAML though, I defined mine in code
and it ended up being a bit messy.

[0]: [https://datascience.blog.wzb.eu/2017/02/16/data-mining-
ocr-p...](https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-
using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/)

------
MayeulC
I've skimmed over the article, which seemed to give a rather sincere overview
of the OCR market, then tesseract, the way it works, and how to interface it
with python.

However, the article is also an advertisement for nanonets, so they also chose
to highlight the complexity side a bit before putting themselves forward.

As someone who hadn't heard of them before, this could be written in the
title. They seem to lease (I prefer that term) an API to do OCR with a couple
rules and templates depending on your use case.

I am not entirely sure what they expect with this? Maybe SEO or to hijack
search results?

------
jftuga
OCRmyPDF (based on Tesseract) works very well:
[https://github.com/jbarlow83/OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF)

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be
searched.

------
wswope
Potentially helpful notes:

The character whitelist/blacklist functionality doesn't work for the default
LSTM-based engine.

Regarding preprocessing, upscaling the image size can have a dramatic impact
on performance.

IIRC tessdata_fast (which the article mentions) is the default that ships with
most prebuilt versions of Tesseract, so you probably don't need to mess with
that. In my use case, I found that tessdata_best actually performed slightly
worse in terms of accuracy.

------
udayrddy
"Let's assume you've created an OCR model to detect Name, Address, DOB from
Drivers Licenses. Since there are 3 categories in this model, each API call
will be priced at $0.01 * 3 = $0.03/image. So if you're on the Medium plan,
you'll get 99/0.03 = 3300 API calls."

Woah !! That is insanely high priced.

------
ngcc_hk
Compatibke with TensorFlow js?

------
Aaargh20318
The title is a bit misleading (and doesn't match the linked article). This
isn't about building an OCR engine, it's about _using_ an existing one.

~~~
dang
Yes. We've changed the title to that of the article. From the site guidelines:
" _Please use the original title, unless it is misleading or linkbait; don 't
editorialize._"

Submitted title was "Building an OCR Engine with Python and Tesseract", which
broke that guideline, assuming the page title didn't change.

