Show HN: `pdf2searchablepdf` command-line tool to make PDF have searchable text

ydant · on Aug 14, 2022

Thank you for sharing!

Another similar project is PDF Sandwich.

I'd like to find one that uses AWS Textract or some other similar OCR - for my rare use-case, it's so much better at OCR that I'm worth paying a few cents to do the OCR. Maybe Tesseract /can/ do a better job than I've seen in the past, but I imagine getting it there would take an investment in time - when usually I'm OCRing one or two PDFs as part of a separate workflow, so I don't want to distract with trying to improve the process.

techie2022 · on Aug 14, 2022

I'll add PDF Sandwich to my readme here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF#a...

Issue opened: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/i...

Can you identify any pros/cons to my software over theirs, or vice versa?

techie2022 · on Aug 14, 2022

I opened an issue for this: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/i...

noodlesUK · on Aug 14, 2022

See also: https://ocrmypdf.readthedocs.io/en/

hudvin · on Aug 18, 2022

Doesn't work very well. First of all there are several separate problems: 1) detect if OCR is required 2) image optimization 3) preprocessing of broken pdf files. And all of them are not easy: 1) page could contain selectable text, but text can't be copied because embedded font doesn't contain glyph->symbol code mapping. Mapping table could contain complete garbage. Sometimes page could contain long urls (added by email services) but all text is provided as image. Sometimes text contains normal text and garbage. And many many other cases. 2) some old scanners generate pdf documents built from 2-5 pixel image stripes. Some of them try to do OCR (poorly). Some of them uses huge DPI. Sometimes you get uncompressed doc in which each page could take up to 200mb. So you need to convert pdf page to image. But you have to choose format and compression options. PNG is ok, but you have to choose correct options (for ghostscript). But output image will be huge. JPG is better, but quality could be low. Sometimes multistage optimization is required. Also tools like ghostscript, fitz or imagemagic doesn't handle all possible pdf/image. 3)weird pdfs - endless story. Poor fonts, broken fonts, very specific cases in pdf standard, issues with image extraction, table of content, viruses, embedded files, annotations, margins/paddings/rotations/translations.

Probably I have to write long post about this )

noodlesUK · on Aug 14, 2022

My bad: I was going to link to deeper into the docs but I changed my mind… guess I mangled the link.

saurik · on Aug 14, 2022

> SORRY

> This page does not exist yet.

jwilk · on Aug 14, 2022

http://ocrmypdf.readthedocs.io/ works.

techie2022 · on Aug 14, 2022

I've got a link to ocrmypdf in my "Alternative Software" section in my readme here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF#a...

Are there any advantages you see of mine vs theirs? I'd like to know what features each has compared to the other.