Hi HN! I've spent a couple of months fiddling with OCR and wanted to share some of my findings.
The approach I share here (fine-tuning recent deep learning models) is the first one that's gotten me anything resembling high-quality OCR on these particular noisy historical documents. OCRing these has been something of a white whale for me for several years (except, a white whale that I have spent comparatively little time on).
At this point I think I am reasonably competent in OCR, but no expert... Curious for your thoughts.
Yeah, I think MS' is the best out there, but agree that the usability leaves something to be desired. 2 thoughts:
1. I believe the IR jargon for getting a JSON of this form is Key Information Extraction (KIE). MS has an out-of-the-box model for this. I just tried the screenshot and it did a pretty good (but not perfect) job. It didn't get every form field, but most. MS sort-of has a flow for fine-tuning, but it really leaves a lot to be desired IMO. Curious if this would be "good enough" to satisfy the use case.
2. In terms of just OCR (i.e. getting the text/numeric strings correct), MS is known to be the best on typed text at the moment [1]. Handwriting is a different beast... but it looks like MS is doing a very good job there (and SOTA on handwriting is very good). In particular, it got all the numbers in that screenshot correct.
Figures, too! Yeah you could write some logic essentially on top of a library like this, and tune based on optimizing for some notion of recall (grab more surrounding context) and precision (direct context around the word, e.g. only the paragraph or 5 surrounding table rows) for your specific application needs.
Using the models underlying a library like this, there's maybe room for fine-tuning as well if you have a set of documents with specific semantic boundaries that current approaches don't capture. (And you spend an hour drawing bounding boxes to make that happen).
Funnily enough, this is another great tactic for getting emails returned (looping in someone with more leverage than you or asking them to follow up for you)!
We should talk! I do work on automatically coding products for a shipping survey at the Census Bureau. One of the earliest production uses of ML here at Census :)
(I am no expert in the analytic underpinnings of the beta distribution or precisely how it is the conjugate prior to the binomial -- or, rigorously speaking, what conjugate prior means -- but the formula here lines up with his formula :P )
reply