Hacker News new | past | comments | ask | show | jobs | submit login
OCRmyPDF can also OCR .png and output .txt and .pdf
2 points by Cognotes 5 months ago | hide | past | favorite
To deal with bad pdfs I split each pdf to single .png pages with ghostscript, then ocrmypdf those .png files which outputs them in pdf with text layer. With the —sidecar option it also outputs a .txt file. I then concatenate all the single page pdfs and dump the .txt in a database for better searching.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: