Hacker News new | past | comments | ask | show | jobs | submit login

There's a brute-force solution to the "extract text from a 'digital-native' image" problem that you can write in an afternoon:

1. Use an existing OCR library to give you the positions of the words, plus a first-cut guess of their content.

2. Take the first word from the OCRed guess, and loop through a set of {font, size, leading} tuples, rendering out the same word at that {font, size, leading} and overlaying it on the image, and measuring error-distance.

3. If your best match isn't within some minimum error-distance, then assume that the OCR misrecognized the first word, and try again with the second, third, etc.

Once you've got a font-settings match:

4. render the rest of the words onto their respective detected bounding boxes;

5. notice which words have a higher error-distance than the rest;

6. for each word, generate candidate mutations of the word (e.g. everything at a Levenstein distance of 1 from the OCRed guess), pick the one that lowers the error-distance, and repeat until the distance for that word won't go down any lower.

7. Return the error-minimized set of words.

You could call this a form of https://en.wikipedia.org/wiki/Code-excited_linear_prediction, with fonts as the pre-trained models.


Actually, come to think of it, it'd be a lot easier to detect and unify "identical" sub-regions of the image first (using e.g. https://en.wikipedia.org/wiki/JBIG2 on a lossless setting). Then you could, in parallel to the above, also try to do frequency-analysis to discover which of your image "tiles" would likely form a basic "alphabet" of character-glyphs—and then hill-climb toward aligning that "alphabet" by attempting to produce the most runs of character-glyphs that translate to known dictionary words in whatever language the OCR thinks the text is in.

The font-matching would still be necessary, though, for the rest of the image samples that don't fall into the easily-frequency-analyzed part. (And for languages that aren't alphabetic, like Chinese, where there are no super-common character-glyphs.)

Another partner and I came up with a similar solution. It hinged on detecting the typeface and using a bitmapped (or otherwise rendered) font package to OCR letter by letter.

The PDF files that we are dealing with do not have embedded text and are not searchable, but are "digital-native," to use the term that you suggested.

Does this not exist? If not, why does it not exist?!

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact