
Apache Tika: Extract and index content and metadata from your files - memexy
https://tika.apache.org/
======
memexy
> The Apache Tika™ toolkit detects and extracts metadata and text from over a
> thousand different file types (such as PPT, XLS, and PDF). All of these file
> types can be parsed through a single interface, making Tika useful for
> search engine indexing, content analysis, translation, and much more.

I'm currently using the HTTP server interface to extract and index metadata
and content from screenshots, html files, and pdfs. Posting a file to /meta
endpoint extracts the metadata and posting the file to /tika endpoint does OCR
and text extraction from screenshots, html, and pdfs which I then store in a
content addressable way (using SHA256 hashes) in files and in CouchDB for easy
searching.

In theory Solr can also be used for the same purpose but I haven't tried that
yet.

