
PDFMiner: Python PDF text parser - iamelgringo
http://www.unixuser.org/~euske/python/pdfminer/index.html
======
nickb
Does anyone know of a good way of converting MS OpenXML (docx, pptx etc) and
old Word (doc, ppt etc) files to txt through command line Linux?

~~~
pmorici
xlhtml will work for .xls files. I'm not aware of one utility that will read
out the text reliably from all MS formats though. I've been pondering this
same problem as of late and have been thinking about a virtual machine
approach.

have a conversion "service" running in a Windows VM that has the latest office
installed. Use something like Python with the pywin32 module to extract text
via COM.

Advantage there is you don't have to worry about your thing breaking with new
releases of Office, you just upgrade to the latest version.

Disadvantage obviously, the overhead of the VM cost of a windows and office
license and need to role back the vm snapshot from time to time if you get any
viruses.

~~~
nickb
I'm hoping there's a way to automate it through OO.o... I'm pretty sure you
can run it headless.

~~~
bootload
_"... I'm hoping there's a way to automate it through OO.o. ..."_

Try <http://search.cpan.org/dist/OpenOffice-OODoc/> and
<http://search.cpan.org/dist/OpenOffice-OODoc/OODoc/Intro.pod> I've had some
success parsing Open Office docs & word docs to text. Be prepared to parse XML
and read the spec.

------
aswanson
Been looking for something like this for a while now. Thanks. Wonder if there
is a Ruby effort underway.

