Hacker News new | past | comments | ask | show | jobs | submit login

As is customary for all of Apache, I have no clue what I’m looking at after trying to read through the links in that page for ten minutes. Like who is this tool for? When should I use this vs any other competing tools? No clue. I suppose it can read documents of any type and give it out as a dictionary? Why would I use this vs pandas?



I can't speak to the Apache documentation, but I once had the task of extracting plain text from many different document formats: Word, spreadsheets, PDFs, the EXIF information in JPEGs, and so on for a long list. I had written code with calls to extractor libraries for several of these formats, when I can across tika. Out when my if..then..elif..elif..elif.. code, to be replaced with a single (Python) call to tika.

I can't answer your question about pandas, though.


I second this, there is absolutely no easily discoverable entry point to the documentation. In the end if you want to get a feeling of what this is you search for "tika tutorial" and get a rough idea via (in my case) some medium article I guess.


There's a book called "Tika in action" which I found useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: