Hacker News new | past | comments | ask | show | jobs | submit login

Why do you use OCR and not PDF to text conversion?

Probably because the pdf is just a big image file? If I understand correctly. Otherwise it should be just copy paste from pdf.

> The PDF files are not scans. They are PDF files created from a Word file.

I am unsure as to why he can't just copy / paste.

Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.

Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact