
Parsing PDFs in Python with Tika - blondie9x
https://cbrownley.wordpress.com/2016/06/26/parsing-pdfs-in-python-with-tika/
======
maxxxxx
PDF is easily the worst document format I have ever worked with. I worked on a
project that fed the contents of PDF files into Lucene to create a full text
index. It was a sheer nightmare to extract text from those. Tika worked but a
lot of documents still converted to garbled text. I ended up using 4 different
PDF to text converters and scored their results by the percentage of known
words they returned. That was a few years ago. I'd be curious if things have
improved by now.

~~~
dunham
We've had good luck with "pdftotext" from poppler/xpdf. The only things that
were garbled were intentionally obfuscated. (The files had a mangled character
encoding table that was added to the pdf to prevent copy/paste.)

You're also gonna have problems with all-image pdfs (scans) that have low
quality OCR results layered on top of them.

If you want to decode tables, "pdftotext -layout" is a good starting point. If
you need more detail, try "pdftohtml -xml -fullfontname".

~~~
maxxxxx
Now that you mention it I remember pdftotext usually proficed the best results
and also was several times faster than tika.

------
daveguy
Here's the direct link to the library (Apache License v2):

[https://github.com/chrismattmann/tika-
python](https://github.com/chrismattmann/tika-python)

Good, simple getting started documentation.

------
based2
[http://tika.apache.org/](http://tika.apache.org/)

[https://pdfbox.apache.org/](https://pdfbox.apache.org/)

[http://seclists.org/bugtraq/2016/May/119](http://seclists.org/bugtraq/2016/May/119)
[CVE-2016-2175] Apache PDFBox XML External Entity vulnerability

