Hacker News new | past | comments | ask | show | jobs | submit login

Regarding 2)

Why wouldn't it work for PDF's? If you're able to get the file itself, you should be able to OCR it...

Is there anything obvious that I am missing in regards to PDFs?




I've worked with OCRed PDFs, the main thing that should be obvious is that OCR results range from poor to horrendous. It takes a lot of manual cleanup if a high degree of accuracy is required. Or depending on why you want the text, you can adjust expectations or add layers of software such as fuzzy search algorithms to deal with the issues.

Again depending on the application, the mixed quality of OCR isn't always a deal breaker, but it's not always as simple as it might appear.


It's not the text that's the issue, it's the structure. PDFs have nowhere near as much structure as markup. You end up having to do this for dozens of layouts and it gets hurty really fast:

http://schoolofdata.org/2013/06/18/get-started-with-scraping...


There are computer vision libraries that automatically extract tables from PDFs. For example, http://ieg.ifs.tuwien.ac.at/projects/pdf2table/.

You may want to give that a try if you haven't looked at it before.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: