Regarding 2) Why wouldn't it work for PDF's? If you're able to get the file itse...

natch · on Aug 11, 2014

I've worked with OCRed PDFs, the main thing that should be obvious is that OCR results range from poor to horrendous. It takes a lot of manual cleanup if a high degree of accuracy is required. Or depending on why you want the text, you can adjust expectations or add layers of software such as fuzzy search algorithms to deal with the issues.

Again depending on the application, the mixed quality of OCR isn't always a deal breaker, but it's not always as simple as it might appear.

lumpypua · on Aug 11, 2014

It's not the text that's the issue, it's the structure. PDFs have nowhere near as much structure as markup. You end up having to do this for dozens of layouts and it gets hurty really fast:

http://schoolofdata.org/2013/06/18/get-started-with-scraping...

dennybritz · on Aug 11, 2014

There are computer vision libraries that automatically extract tables from PDFs. For example, http://ieg.ifs.tuwien.ac.at/projects/pdf2table/.

You may want to give that a try if you haven't looked at it before.