
Apache Tika – a content analysis toolkit - loa_in_
https://tika.apache.org/
======
geerlingguy
Tika has been around for ages, and I remember many of the early versions
(probably up to 1.2 or 1.3) would completely explode if you threw in a PDF
with some UTF-8 characters, or Word documents with many foreign/non-ASCII
languages.

Thankfully nowadays the problem I run into the most is memory exhaustion when
some client uploads a 500+ MB PDF and expects my cheap Solr SaaS
([https://hostedapachesolr.com](https://hostedapachesolr.com)) to handle
extraction for these giant files!

~~~
arafalov
Yeah, we are seeing the same problem in Solr User Group discussions. That Tika
integration will need to change to something else (with Standalone Tika). Of
course, Drupal's Solr module already allows to provide separate Tika, but I
guess there is no equivalent Tika SAAS service, so they just point it all at
your service.

------
zkanda
We had a problem where we need to index and make searchable a hundred of
thousands of government pdf files, some are as old as 15 years ago.

Tried a bunch libraries and settled with Tika. Although we were a PHP/Node
shop, nothing could be compared to the ease of using Tika for this exact
purpose.

------
GEBBL
A timely link given there is a current discussion on the trouble with
extracting meaningful text from pdfs in another thread on the front page! I
look forward to reading the feedback on actual use of this

~~~
mumblemumble
Tika's PDF text extraction is fine if you're just trying to get searchable
text. Which is what it's made for: Slurping doucments into Lucene. Fulltext
search typically isn't terribly sensitive to getting the order of words right,
and is even less sensitive to getting the formatting right.

If you're trying to get something fit for consumption by human (including via
a screen reader) or an NLP pipeline, though, all the problems discussed in
that FilingDB article still apply.

~~~
tonitosou
aspose can convert pdf to html

~~~
chaps
Firefox's PDF reader does the same. A few years back, I wrote wrote a pdf to
csv converter with selenium -- it worked surprisingly well! Though, after I
finished I found tabula and the code became immediately useless, hah.

~~~
rovr138
When that happens to me, it's due to a keyword I hadn't thought of looking
for. Then, while building the project, it came to me.

------
lazycrazyowl
Apache UIMA played a key roles in the data intelligence and analytic
proficiency of the IBM Watson supercomputer, playing against human champions
on the TV show "Jeopardy!” and uses Tika for UIMA annotation.

[https://blogs.apache.org/foundation/entry/apache_innovation_...](https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s)

------
akerro
Is there any website, table or whatever that compiles all/most apache projects
and described them in one or two sentences?

Nifi, Flink, Mesos, Kafka, Cascoon, Cordova, Hadoop.

~~~
arafalov
It is a VERY long list:
[https://projects.apache.org/projects.html](https://projects.apache.org/projects.html)

~~~
akerro
I'm also aware of this one, doesn't have descriptions.

~~~
arafalov
The descriptions are behind the link, driven by the DOAP files each project
maintain.

But I agree, it would have been nice to have a version of this page with first
para of the description inlined in the listing itself.

------
liability
I just tried this out on a handful of PDFs, comparing it to Calibre's `ebook-
convert`

They seem roughly equivalent, neither better than the other. Particularly,
both fail the same dehyphenations, a category of error that's extremely
frustrating for text-to-speech users. By default tika seems more aggressive
when joining split lines, but without good dehyphenation that's not worth
much, and some of the lines it joins shouldn't be joined.

Calibre's is also several times faster, but the difference between 1 second
and 7 seconds isn't really a big deal for my purposes.

------
merricksb
If curious, see (small) discussion on HN from 8 years ago:

[https://news.ycombinator.com/item?id=3878054](https://news.ycombinator.com/item?id=3878054)

------
mikhailfranco
Is Tika any good for MS Word structure?

Apache POI is amazingly complex and undocumented. It is one of the few Java
libraries I have ever seen, where classes have setters but no getters, and the
hacks you find on Stackoverflow involve reflective traversals and coercion of
access modifiers.

------
nunorbatista
Last month I had to do a PDF parser and searched a lot for a solution like
Tika, but strangely this didn't come up. Cool, I'll test.

------
heffer
I only learned of Tika through Dovecot which can use it to include email
attachments in it's Full Text Search index using it. Pretty neat stuff.

