
Ask HN: Is there a ready-to-go solution to parse documents content? - fpd4444
Hey there<p>I&#x27;m looking for a ready-to-go solution to parse documents (such as pdf, docx, pptx and others). By &#x27;parse&#x27; I mean text extraction, including OCR if needed. 
I know about Tika and tried it already, but are there any more reliable alternatives, maybe based on Tika?
I&#x27;d like to interact with it via REST API.<p>Thnx
======
programd
Tika has a REST server built in
[https://wiki.apache.org/tika/TikaJAXRS](https://wiki.apache.org/tika/TikaJAXRS)

Is there some functionality that you need and is not covered there? Some
specific document extraction feature?

------
awinder
ElasticSearch has support for ingesting a bunch of document formats if you're
already using it / looking at using it in your stack:

[https://www.elastic.co/guide/en/elasticsearch/plugins/5.x/in...](https://www.elastic.co/guide/en/elasticsearch/plugins/5.x/ingest-
attachment.html)

~~~
kognate
Elastic is using Tika.

~~~
orbz
Yep Tika is pretty much the choice in this situation. It's not perfect but
it's good enough for most purposes.

------
kognate
Yes, the [https://www.ibm.com/watson/developercloud/document-
conversio...](https://www.ibm.com/watson/developercloud/document-
conversion.html) Watson Document Conversion service meets those requirements.
It's not free, and it's not popular, but it's reliable.

~~~
sochix
Good one, will look on it

------
zmix
I use 'poppler' or Apache's PDFBox for text extraction from PDF. They both can
write HTML or their own XML format. In addition, they keep the absolute
positioning of the layout.

For XML files, there is XSL-T. A simple run with the default template will
give you all strings in the document, if you really want just the paragraph
text, you will need to find/create an XSL transform.

None of these is ready to go, but very close to it. Epecially in the case of
poppler and pdfbox.

------
fpd4444
Thanks guys. But none of your suggestions solve the whole problem (some don't
include OCR, some support only limited file types and other). I'd like to have
a black box that does everything for me (does OCR if needed, extracts pds,
docs, txts and others).

But I'm afraid there's no such solution...

------
rakoo
I'm afraid I may be late to the party, but I've seen
[https://github.com/openpaperwork/paperwork](https://github.com/openpaperwork/paperwork)
before and it looked like a good solution for this. Never tried though.

~~~
sochix
It's not an enterpise class solution ;(

~~~
rakoo
What makes an enterprise class solution ?

------
derwiki
We've been using Google Cloud Vision's OCR service with pretty good accuracy
(varies from ~80-99%).

[https://cloud.google.com/vision/docs/](https://cloud.google.com/vision/docs/)

------
hbcondo714
We've used Aspose for manipulating PDFs but they work with "over 100 file
formats". Offers both SDKs and RESTful APIs

[https://www.aspose.com](https://www.aspose.com)

------
sochix
Maybe this [https://rawtext.ambar.cloud/](https://rawtext.ambar.cloud/) ?

------
assafmo
Tika, pdftotext, lynx (html), tesseract (ocr)

~~~
sochix
It's very difficult to combine it together, as documents in different types
has a lot of edge cases

~~~
assafmo
Yeah, it's a beat effort deal, but a pretty reliable one (except maybe
tesseract-ocr)

