
PDFMiner in Python - J3L2404
http://www.unixuser.org/~euske/python/pdfminer/index.html
======
mrleinad
I worked for about 3 years for an spanish law website which main business goal
is to provide a centralized access point to legal documents for lawyers, and
transforming PDFs to text was a regular task to do. We used perl and lots of
PDF -> text/html tools. I can say it's one of the most horrible tasks to
perform, and I'm glad there are more options to work with those documents, but
PDFs as a method for content distribution should be shot and buried in the
desert. It´s annoying to say the least when you don't have the necessary
plugins/reader installed, and of course they can't be easily converted to
other forms of documents without destroying something in the process. I'm
happy I don't have to keep working on that anymore. A good thing though, is
that now I master regular expressions.

Anyway, if someone gave this PDFMiner a shot, let me know how good/bad it is.

~~~
tren
I tried converting a relatively simple PDF document into HTML and the results
were average. There were overlapping fonts, missing images etc.

Working with PDFs is an extremely frustrating experience. For years I've dealt
with the poorly documented Adobe SDK and many third party tools through
working in the ebook industry. In my opinion converting PDF to HTML is close
to an impossible task that will never yield consistent results and will always
require manual intervention.

However, a few years ago we've developed an online reader that renders PDF
files that have been converted to images with a text overlay that is far more
reliable than a conversion to HTML. You can see an example book for free here:
<http://amigoreader.com/moonstone/>. We hope to open this up for users to
share and discuss their PDF documents in the near future.

~~~
itmag
Dude, I am very interested in this.

When are you launching?

~~~
tren
Early 2012 we're aiming for. We're launching our WP7 and Android readers
first.

------
trentonstrong
I had the pleasure of using the out of the box pdf2txt tool just yesterday.
Worked pretty well for extracting some governmental data released (i.e.
buried) inside of a PDF!

------
doktrin
Sick! The lack of adequate functionality in PyPDF has been a pet peeve of mine
for a while! I really look forward to trying this out.

------
gahahaha
I am disappointed that it can't extract the text from a password protected pdf
since I have a few such documents. Not knowing much about the pdf format I
would assume the text would be easy to extract since it is easy to show the
text on the screen. What is the best way to print and copy/paste from such
documents?

~~~
narcissus
I can't remember the exact process I went through for my password protected
PDFs, but the majority of them could be worked by converting them to a PS file
and then back to PDF. I think I used GhostScript?

