Hacker News new | past | comments | ask | show | jobs | submit | c_moscardi's comments login

Came to post this — it’s the same underlying technology, just a lot more compute now.

Hi HN! I've spent a couple of months fiddling with OCR and wanted to share some of my findings.

The approach I share here (fine-tuning recent deep learning models) is the first one that's gotten me anything resembling high-quality OCR on these particular noisy historical documents. OCRing these has been something of a white whale for me for several years (except, a white whale that I have spent comparatively little time on).

At this point I think I am reasonably competent in OCR, but no expert... Curious for your thoughts.


Related reading; explains the same concept quite well IMO with NYC subway data. This is where I learned about this concept.

[1] https://erikbern.com/2016/04/04/nyc-subway-math

[2] https://erikbern.com/2016/07/09/waiting-time-math.html


Yeah, I think MS' is the best out there, but agree that the usability leaves something to be desired. 2 thoughts:

1. I believe the IR jargon for getting a JSON of this form is Key Information Extraction (KIE). MS has an out-of-the-box model for this. I just tried the screenshot and it did a pretty good (but not perfect) job. It didn't get every form field, but most. MS sort-of has a flow for fine-tuning, but it really leaves a lot to be desired IMO. Curious if this would be "good enough" to satisfy the use case.

2. In terms of just OCR (i.e. getting the text/numeric strings correct), MS is known to be the best on typed text at the moment [1]. Handwriting is a different beast... but it looks like MS is doing a very good job there (and SOTA on handwriting is very good). In particular, it got all the numbers in that screenshot correct.

If you want to see the results from MS on the screenshot in this blog post, here's the entire JSON blob. A bit of a behemoth but the key/value stuff is in there: https://gist.github.com/cmoscardi/8c376094181451a49f0c62406e...

[1] https://mindee.github.io/doctr/latest/using_doctr/using_mode...


That does look pretty great, thanks for the tip.

Sending images through that API and then using an LLM to extract data from the text result from the OCR could be worth exploring.


Figures, too! Yeah you could write some logic essentially on top of a library like this, and tune based on optimizing for some notion of recall (grab more surrounding context) and precision (direct context around the word, e.g. only the paragraph or 5 surrounding table rows) for your specific application needs.

Using the models underlying a library like this, there's maybe room for fine-tuning as well if you have a set of documents with specific semantic boundaries that current approaches don't capture. (And you spend an hour drawing bounding boxes to make that happen).


Funnily enough, this is another great tactic for getting emails returned (looping in someone with more leverage than you or asking them to follow up for you)!


We should talk! I do work on automatically coding products for a shipping survey at the Census Bureau. One of the earliest production uses of ML here at Census :)

5 minute deck: https://github.com/codingitforward/cdfdemoday2018/blob/maste...

Feel free to shoot me a message.


Yep, my read too. The code he provides is definitely the Clopper-Pearson formula. [1]

This confused me as well, since I've always used the normal central-limit-based confidence interval for binary outcomes.

[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

(I am no expert in the analytic underpinnings of the beta distribution or precisely how it is the conjugate prior to the binomial -- or, rigorously speaking, what conjugate prior means -- but the formula here lines up with his formula :P )


I feel like they're learning from the mistakes of their forbears and getting out before the getting's truly bad...


Author here! I'll keep an eye on this thread and would love to discuss anything in the post that's of interest to you.


I really like the theme on your blog.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: