So, I don't think that the first two parts, converting the PDF page to an image to get the text, is necessary. One could just use the basic information in a PDF content stream to get the bounding box for each character. The resulting information can then still be analyzed using the layout analysis algorithm mentioned as step 3.
The information would also be more exact since extracting the character positions from an image depends on the rendering of the PDF to the image (i.e. if an A4 page is rendered at 300ppi or at 600ppi or higher).
The idea is to also generally handle scanned documents as well. Besides sometimes text boxes can get very distorted with whitespace such that the boxes look to a computer very different then they do in new life.
In practice, you are right that this would be more efficient in many cases (not scanned, no weird whitespace), but in practice, the cost of OCR is so low compared to the LLM costs and the relative consistency of OCR outputs helps a lot means that I don't try to handle the PDF object extraction.
Not quite! This goes one step further to serialize the text for complex layouts in a sensible manner, using a geometry-based approach as specified in the blog post :)
The information would also be more exact since extracting the character positions from an image depends on the rendering of the PDF to the image (i.e. if an A4 page is rendered at 300ppi or at 600ppi or higher).