Thankfully nowadays the problem I run into the most is memory exhaustion when some client uploads a 500+ MB PDF and expects my cheap Solr SaaS (https://hostedapachesolr.com) to handle extraction for these giant files!
Tried a bunch libraries and settled with Tika. Although we were a PHP/Node shop, nothing could be compared to the ease of using Tika for this exact purpose.
If you're trying to get something fit for consumption by human (including via a screen reader) or an NLP pipeline, though, all the problems discussed in that FilingDB article still apply.
It's a really easy to use generic parser for a bunch of document types. The downside of being so generic and easy to use is that you end up lacking document-specific context that could be useful. For example: Do you consider the header/footer text to be important, or just noise (Page 1, Page 2, etc.)? Is the text contained in the Table of Contents or section headers important, or just the actual content? You won't find any ways to tweak the result, which could be a good or bad thing depending on your use case.
We ended up using it as our "fallback" parser, writing more contextually aware ones for document types of greater importance to our use case (PDFs were high on the list).
Nifi, Flink, Mesos, Kafka, Cascoon, Cordova, Hadoop.
But, hosting the output was more tricky. PDF export ended up 140MB. PNG ended up 10MB but poor quality. SVG is perfect! But only found one service to host it, and seems even this SVG is too large for them to handle! (50MB)
So, anywhere I can host big SVGs? Then I could publish the quick hack I made to get the title, programming language + description from the projects list and into something you can quickly scan over.
Edit: Aha, found one host that was OK with the file size of a bigger PNG (image is 16806x9984, keep in mind) https://de.catbox.moe/espesr.png Also keep in mind, I'm not a designer, so it is what it is.
Will try to upload the SVG that is a bit more friendly on the data and performance. Edit2: SVG version https://de.catbox.moe/qppy6t.svg
But I agree, it would have been nice to have a version of this page with first para of the description inlined in the listing itself.
as the data source for this page
They seem roughly equivalent, neither better than the other. Particularly, both fail the same dehyphenations, a category of error that's extremely frustrating for text-to-speech users. By default tika seems more aggressive when joining split lines, but without good dehyphenation that's not worth much, and some of the lines it joins shouldn't be joined.
Calibre's is also several times faster, but the difference between 1 second and 7 seconds isn't really a big deal for my purposes.
Apache POI is amazingly complex and undocumented. It is one of the few Java libraries I have ever seen, where classes have setters but no getters, and the hacks you find on Stackoverflow involve reflective traversals and coercion of access modifiers.