More than just wrapping the OCR, there is also a document reconstruction / cleaning pipeline that take care of reading order, heading detection and classification, table detection and reconstruction, ... so that you have an as clean and usableas possible Text / JSON as an output.
This one packages a off the shelf version into a Docker, and starts a GUI website locally. Looking forward to using this more!
A comprehensive competitor comparison, along with outputs, is available at https://extracttable.com/compare.html
Edit: I mean as a part of a custom solution
We also support pdf.js as an alternative to pdfminer.
For a quick test you can either run the jupyter notebook
or run the docker with the UI interface and just drag and drop documents / play with the configuration
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
We, extracttable.com - extract tabular data from images and PDFs over API, are interested to contribute and integrate the service into the bundle.
Why this over Tika?
They don't seem to be even using Tika behind the hood as any of the bundled tools. Perhaps anyone has some comparisons?