Hacker News new | past | comments | ask | show | jobs | submit login

Debian has a command-line tool 'pdftotext' which extracts the text from a PDF. It is not OCR, it pulls the characters from the file itself. Its in the package called poppler-utils.



Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.


There are a couple of ways a PDF document could contain actual text that is nevertheless not selectable or searchable. One is that the originator could have protected the document; another (more common) cause is that the originator didn't embed the proper font maps when exporting the document. I see the latter a lot with documents produced from LaTeX originals. As the parent mentioned, pdftotext can often extract text from such documents without the need for OCR. (Although sometimes if the document contains ligatures those don't get converted.)




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: