PDF to Text, Done Right

gettalong · on Sept 20, 2023

So, I don't think that the first two parts, converting the PDF page to an image to get the text, is necessary. One could just use the basic information in a PDF content stream to get the bounding box for each character. The resulting information can then still be analyzed using the layout analysis algorithm mentioned as step 3.

The information would also be more exact since extracting the character positions from an image depends on the rendering of the PDF to the image (i.e. if an A4 page is rendered at 300ppi or at 600ppi or higher).

dubbid · on Sept 20, 2023

The idea is to also generally handle scanned documents as well. Besides sometimes text boxes can get very distorted with whitespace such that the boxes look to a computer very different then they do in new life.

In practice, you are right that this would be more efficient in many cases (not scanned, no weird whitespace), but in practice, the cost of OCR is so low compared to the LLM costs and the relative consistency of OCR outputs helps a lot means that I don't try to handle the PDF object extraction.

gettalong · on Sept 20, 2023

Fair point :) And yes, some PDFs use weird ways to represent the spacing between words.

dubbid · on Sept 19, 2023

An optical-character-recognition- and computer vision-based approach, better then anything else commercially or open-source available.

jackreaper1 · on Sept 19, 2023

This is just a wrapper of pytesseract...

dubbid · on Sept 19, 2023

Not quite! This goes one step further to serialize the text for complex layouts in a sensible manner, using a geometry-based approach as specified in the blog post :)