
Ask HN: How Do I Get Started with OCR(Optical Character Recognition)? - muralimadhu
I have no background in machine learning or computer vision. What I do have is a problem statement. I want to be able to parse and get structured text out of financial documents like W2s and Paystubs. For ex)parse out company name, salary etc from a W2. Off the shelf solutions like AWS Textract doesnt work very well. So far I have only been treating OCR as a blackbox. If I were to build an OCR service myself for a specialized set of financial documents, what theory and tools would I need to learning assuming I have a CS background, but not an ML background? Thanks in advance
======
ocrcustomserver
What is the reason that you want to roll your own?

Is it because you want to own the IP or for learning purposes?

------
staticautomatic
I would continue searching for an off-the-shelf solution, particularly if
you're able to throw a small amount of money at the problem. DocParser is
probably what I'd try first.

How complicated it is to roll your own really depends on how much variation
there is in your documents (in a variety of ways, from "What size, shape, and
resolution is the image?" to "Is what you're trying to extract always in the
same relative location?").

~~~
muralimadhu
The data is not always in the same location, and the images/documents are user
submitted, so no guarantees about resolution. I'll continue to look for off-
the-shelf solutions. If I were to invest in doing this myself, what would you
recommend? Are there any books/courses that'll help me with the foundations?

~~~
staticautomatic
I strongly caution you against doing it yourself, but if you're determined,
here are some things you'll need to deal with at high level.

1) Image standardization: All user-submitted images should be the same
resolution and the same size on at least one axis.

2) Pre-processing. Rotation, skew, and binarization, mainly. You could do this
yourself or let an OCR API handle it for you (many do, both local and cloud).

Then there's basically two paths in a rudimentary solution. You could do full-
page OCR, store the coordinates of extracted n-grams, and then inspect the
area where you think or know the text you want to extract will be. Or, you
could crop the image to a rectangle where you think or know the text will be,
OCR the rectangle, and see if the text is there. The latter is computationally
cheaper but the rectangle isn't guaranteed to contain the text of interest--
assuming it's there at all-- and you won't easily be able to define things in
relation to each other.

The top commercial products are a lot more complicated than this, though. A
"template" for extracting Box 1 from a W-2 would do something like this:

\- Crop the document to just the top third and OCR it

\- Match the string "1 Wages, tips, or other compensation" which itself is
something like "Try to find that exact string. No? Look for a 1 or something
that resembles a 1. Is there an n-gram to the right of it beginning with the
word Wages? No? OK how about wages in lower case. No? OK restrict the pattern
match to just 1 line. Does that work? OK good. Does something like the word
'tips' appear in the n-gram? Ok that's probably it."

\- Underneath that line, offset x to the right, look for a string of digits
matching this regex.

The string matching is done probabilistically in branches based on the
specified rules (like "there's this n-gram over here that seems like what
you're looking for but you said it should be in this other location, so we'll
mark that as a possibility but it's probably not what you're looking for") and
with the assistance of a really powerful dictionary.

If you don't want to totally roll your own and you happen to write Java, check
out OpenKM, which I believe has some of the necessary abstractions built in
for zonal OCR.

~~~
muralimadhu
Thanks for the detailed response

