

Apache Tika - a content analysis toolkit - zerop
http://tika.apache.org/

======
mtoddh
+1 for Tika. I've been using Tika in conjunction with SOLR for a job search
engine I built (www.neekanee.com) and it's really useful for extracting text
from PDFs and .doc files if you're not overly concerned with the structure of
the document.

------
typicalrunt
_The Apache Tika™ toolkit detects and extracts metadata and structured text
content from various documents using existing parser libraries. You can find
the latest release on the download page. See the Getting Started guide for
instructions on how to start using Tika._

I read that and I still don't get it, but I can't find any other information
on the site. Can someone provide a specific example of what problem this
project solves?

~~~
PabloOsinaga
At the core, it is a library that extracts text from various document formats
(PDF, xlsx, docx, pptx, etc.).

It is important to note, however, that Tika in itself doesn't implement the
text extractors, but rather combines different different projects (e.g.,
Apache POI) into one simple interface "getText".

If you are performance-oriented, thought, you may find that you will find
yourself using the individual parsers as you'd want different settings/options
to maximize throughput.

------
bergie
We used Tika for extracting data from various complex Office documents and
converting them to Linked Data in the Proggis project:
[http://bergie.iki.fi/blog/business_analytics_with_couchdb_an...](http://bergie.iki.fi/blog/business_analytics_with_couchdb_and_noflo/)

In addition to content extraction, Tika seems able to do some interesting
things like figuring out the language of the document contents.

------
vineet
On a related note, we are trying out a site which should help use projects
like Tika - I am curious what you guys think of the site:
<http://www.codemaps.org/c/Apache_Tika>

------
PaulHoule
I'd like to see a comparison with Apache UIMA.

