
PDF Miner - newsit
http://pypi.python.org/pypi/pdfminer/20090330
======
brandnewlow
If anyone wants to use this for something public-service oriented:

Chicago is running for the 2016 Olympic games. About a month ago they released
their official "bid book" in PDF form. The local papers gave it a look and
wrote some fine stories, but a bunch of local journalists (myself among them)
would like to extract the thing out into a Wiki so people could discuss and
annotate it instead of just reading it in PDF form.

Link to the bid book: <http://www.chicago2016.org/our-plan/bid-book/bid-
book.aspx>

We were thinking of using MediaWiki as the wiki engine. One of us is currently
running (the excellent) Chicago Elections Wiki over at
<http://chicagoelections.pbwiki.com/>

We'd host, promote, annotate and fill out the wiki, the important thing is to
move this from a pdf to an interactive, scannable, hypertext format so people
can tear it apart.

We'd been talking about sneaking into PyCon and asking around if anyone there
would be interested in working on this. It looks like this PDF miner is the
start of something that could do this.

~~~
dmv
For a one-off use, writing code (and finding a coder) might be overkill. I
have been impressed by the results produced by <http://www.pdftoword.com>,
which would render to Word or RTF from extremely weird pdf formatting. You
should be able to safely convert from there.

~~~
brandnewlow
Thanks for the link. I agree with you there. I was thinking maybe a general
PDF-to-MediaWiki/PBWiki script might be of use for this and other projects.

There's an open government hackathon this week at PyCon here in Chicago and it
looks like at least three of the proposed ideas are PDF-to-Text apps.

<http://feedback.sunlightfoundation.com/hackathon/>

------
bd
I used it recently for analysis of PDF articles.

It's quite good, though as it is written in pure Python, it's rather slow
(especially compared to command line tools written in C/C++).

I strongly recommend using Psyco [1]. Adding few lines of code cut my
PDF->HTML conversion times by half.

Also, be warned that markup it produces can be very heavy. Depending on how
PDF is structured, you can finish with huge amount of DOM elements.

\-----

[1] <http://psyco.sourceforge.net/>

------
latortuga
For our startup we had a huge integration project with an industry-specific
PDF and so I ended up writing a PDF importer that sounds like it does
something similar to this project. The best part is that I couldn't figure out
how to get my reader to determine what page a specific set of coordinates was
on and it looks like this library supports it - thanks for the link!

------
jpcx01
Looks interesting. Any good ruby alternatives?

~~~
draegtun
Not sure. However there is a well established Perl one...
<http://search.cpan.org/dist/CAM-PDF/>

------
albertsun
Nice stuff. So many public documents are released in PDF format instead of an
easy to work with plain text format.

------
mahmud
Does anyone know if something like this exists for C? It would be nice to be
able to call it from $LANGUAGE.

~~~
bd
You could use innards of some open source PDF viewer (most of them are written
in C++).

I managed to get away just with using command line PDF->TXT tool that came
with Xpdf [1].

Also MuPDF is awesome [2]. Its bare bones demo PDF viewer replaced Foxit as my
default PDF handler.

If you have some more serious budget, there is also PDFlib TET [3], a nice
commercial solution with bindings for many languages (C, C++, Python, Ruby,
Perl, .NET, and many others).

\-----

[1] <http://www.foolabs.com/xpdf/download.html>

[2] <http://ccxvii.net/fitz>

[3] <http://www.pdflib.com/en/products/tet/>

~~~
visitor4rmindia
VimTip: You can integrate xpdf with vim to view PDF's directly in your editor.

    
    
        autocmd BufReadPre *.pdf set ro
        autocmd BufReadPost *.pdf silent %!pdftotext "%" -

~~~
latortuga
Unencrypted PDF files can be opened in any editor, they are plaintext.

------
sketerpot
AAAAAAGHFGUREH!!! I had to write my own a few months ago, which sucked. If I
had known about this, I could have been saved a lot of effort. Noooooo!

Technology moves forward, I see.

