Hacker News new | past | comments | ask | show | jobs | submit login
Apache Tika – a content analysis toolkit (apache.org)
153 points by loa_in_ 13 days ago | hide | past | favorite | 27 comments

Tika has been around for ages, and I remember many of the early versions (probably up to 1.2 or 1.3) would completely explode if you threw in a PDF with some UTF-8 characters, or Word documents with many foreign/non-ASCII languages.

Thankfully nowadays the problem I run into the most is memory exhaustion when some client uploads a 500+ MB PDF and expects my cheap Solr SaaS (https://hostedapachesolr.com) to handle extraction for these giant files!

Yeah, we are seeing the same problem in Solr User Group discussions. That Tika integration will need to change to something else (with Standalone Tika). Of course, Drupal's Solr module already allows to provide separate Tika, but I guess there is no equivalent Tika SAAS service, so they just point it all at your service.

We had a problem where we need to index and make searchable a hundred of thousands of government pdf files, some are as old as 15 years ago.

Tried a bunch libraries and settled with Tika. Although we were a PHP/Node shop, nothing could be compared to the ease of using Tika for this exact purpose.

A timely link given there is a current discussion on the trouble with extracting meaningful text from pdfs in another thread on the front page! I look forward to reading the feedback on actual use of this

Tika's PDF text extraction is fine if you're just trying to get searchable text. Which is what it's made for: Slurping doucments into Lucene. Fulltext search typically isn't terribly sensitive to getting the order of words right, and is even less sensitive to getting the formatting right.

If you're trying to get something fit for consumption by human (including via a screen reader) or an NLP pipeline, though, all the problems discussed in that FilingDB article still apply.

aspose can convert pdf to html

Firefox's PDF reader does the same. A few years back, I wrote wrote a pdf to csv converter with selenium -- it worked surprisingly well! Though, after I finished I found tabula and the code became immediately useless, hah.

When that happens to me, it's due to a keyword I hadn't thought of looking for. Then, while building the project, it came to me.

I used Tika to build a search engine prototype, and it was fantastic for getting us up and running quickly.

It's a really easy to use generic parser for a bunch of document types. The downside of being so generic and easy to use is that you end up lacking document-specific context that could be useful. For example: Do you consider the header/footer text to be important, or just noise (Page 1, Page 2, etc.)? Is the text contained in the Table of Contents or section headers important, or just the actual content? You won't find any ways to tweak the result, which could be a good or bad thing depending on your use case.

We ended up using it as our "fallback" parser, writing more contextually aware ones for document types of greater importance to our use case (PDFs were high on the list).

so how did u do to understand the "form" of the document such as table of contents and co.

I’ve found Tika’s PDF to HTML parser to be pretty good. My only complaint is that in a double spaced document where there is equal amount of space between paragraphs and normal lines, it labels every line as a separate paragraph.

Apache UIMA played a key roles in the data intelligence and analytic proficiency of the IBM Watson supercomputer, playing against human champions on the TV show "Jeopardy!” and uses Tika for UIMA annotation.


Is there any website, table or whatever that compiles all/most apache projects and described them in one or two sentences?

Nifi, Flink, Mesos, Kafka, Cascoon, Cordova, Hadoop.

Saw the other comments here, didn't find a good place either. Got inspired by the discussion here to write a quick Figma plugin to take JSON input and generate elements based on it. Using it together with the instances is amazing, very quick to prototype stuff.

But, hosting the output was more tricky. PDF export ended up 140MB. PNG ended up 10MB but poor quality. SVG is perfect! But only found one service to host it, and seems even this SVG is too large for them to handle! (50MB)

So, anywhere I can host big SVGs? Then I could publish the quick hack I made to get the title, programming language + description from the projects list and into something you can quickly scan over.

Edit: Aha, found one host that was OK with the file size of a bigger PNG (image is 16806x9984, keep in mind) https://de.catbox.moe/espesr.png Also keep in mind, I'm not a designer, so it is what it is.

Will try to upload the SVG that is a bit more friendly on the data and performance. Edit2: SVG version https://de.catbox.moe/qppy6t.svg

I'm also aware of this one, doesn't have descriptions.

The descriptions are behind the link, driven by the DOAP files each project maintain.

But I agree, it would have been nice to have a version of this page with first para of the description inlined in the listing itself.

Yea, I know about this one where 1/3 is undocumented and sentences are non for technical people or super generic.

Be the change you want to see in the world.

I just tried this out on a handful of PDFs, comparing it to Calibre's `ebook-convert`

They seem roughly equivalent, neither better than the other. Particularly, both fail the same dehyphenations, a category of error that's extremely frustrating for text-to-speech users. By default tika seems more aggressive when joining split lines, but without good dehyphenation that's not worth much, and some of the lines it joins shouldn't be joined.

Calibre's is also several times faster, but the difference between 1 second and 7 seconds isn't really a big deal for my purposes.

If curious, see (small) discussion on HN from 8 years ago:


Is Tika any good for MS Word structure?

Apache POI is amazingly complex and undocumented. It is one of the few Java libraries I have ever seen, where classes have setters but no getters, and the hacks you find on Stackoverflow involve reflective traversals and coercion of access modifiers.

Last month I had to do a PDF parser and searched a lot for a solution like Tika, but strangely this didn't come up. Cool, I'll test.

I only learned of Tika through Dovecot which can use it to include email attachments in it's Full Text Search index using it. Pretty neat stuff.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact