This one packages a off the shelf version into a Docker, and starts a GUI website locally. Looking forward to using this more!
More than just wrapping the OCR, there is also a document reconstruction / cleaning pipeline that take care of reading order, heading detection and classification, table detection and reconstruction, ... so that you have an as clean and usableas possible Text / JSON as an output.
A comprehensive competitor comparison, along with outputs, is available at https://extracttable.com/compare.html
Edit: I mean as a part of a custom solution
We also support pdf.js as an alternative to pdfminer.
For a quick test you can either run the jupyter notebook
or run the docker with the UI interface and just drag and drop documents / play with the configuration
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
Why this over Tika?
They don't seem to be even using Tika behind the hood as any of the bundled tools. Perhaps anyone has some comparisons?
We, extracttable.com - extract tabular data from images and PDFs over API, are interested to contribute and integrate the service into the bundle.