Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PDF to Text, Done Right (pdfclean.dev)
4 points by dubbid on Sept 19, 2023 | hide | past | favorite | 6 comments


So, I don't think that the first two parts, converting the PDF page to an image to get the text, is necessary. One could just use the basic information in a PDF content stream to get the bounding box for each character. The resulting information can then still be analyzed using the layout analysis algorithm mentioned as step 3.

The information would also be more exact since extracting the character positions from an image depends on the rendering of the PDF to the image (i.e. if an A4 page is rendered at 300ppi or at 600ppi or higher).


The idea is to also generally handle scanned documents as well. Besides sometimes text boxes can get very distorted with whitespace such that the boxes look to a computer very different then they do in new life.

In practice, you are right that this would be more efficient in many cases (not scanned, no weird whitespace), but in practice, the cost of OCR is so low compared to the LLM costs and the relative consistency of OCR outputs helps a lot means that I don't try to handle the PDF object extraction.


Fair point :) And yes, some PDFs use weird ways to represent the spacing between words.


An optical-character-recognition- and computer vision-based approach, better then anything else commercially or open-source available.


This is just a wrapper of pytesseract...


Not quite! This goes one step further to serialize the text for complex layouts in a sensible manner, using a geometry-based approach as specified in the blog post :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: