

Ask HN: How to extract text from popular document formats - JanezStupar

I guess this is probably the best place on teh interwebz to get this kind of information.<p>I'm looking for a library, tool, whatever widget that would enable me to extract raw text from popular document formats (pdf, Word 97/2003/2007, OpenOffice, rtf - any others would be a bonus).<p>The tool does not need to be OpenSource - commercial tools are also welcome, as long as they are not prohibitively expensive.<p>The use case is extracting text for full-text indexing via Apache Solr. I am aware that Solr can handle indexing whole documents - however I would rather not have it juggle loads of raw documents. And I simply haven't enough time/motivation to roll my own parsers.<p>Update: Too lazy to google it for myself? Apache Tika: http://tika.apache.org/<p>Thank you a lot.
======
tsycho
antiword (<http://freshmeat.net/projects/antiword>) works well for Microsoft
word documents.

Are you trying to create an index/search utility for documents? If so, maybe
we can combine efforts. I was planning to build one, wrote a small amount
code, but then got busy with other stuff and the project fell through the
cracks.

