Hacker News new | past | comments | ask | show | jobs | submit login

I believe the best you could do is extract the raw OCR'd text from the document (with some other tool). No formatting or text hierarchy is preserved in the OCR process, only the physical locations and size of the text on the page. From text, you can convert to Markdown or whatever and then manually edit to give the OCR text some structure.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact