
Pdftabextract – A set of tools for data mining OCR-processed PDFs - happy-go-lucky
https://github.com/WZBSocialScienceCenter/pdftabextract
======
derwiki
I did a doubletake; I thought I had just seen this on HN; turns
PDFLayoutTextStripper was on the front page a few days ago:
[https://news.ycombinator.com/item?id=13729301](https://news.ycombinator.com/item?id=13729301)

------
markovbling
awesome! any guidance on why I might use this rather than Tabula?

~~~
nycdatasci
Tabula works on text-based PDF documents, not on scanned content so I assume
it's not using OCR?

------
mrdrozdov
Anyone using this yet to automatically track SotA results on machine learning
tasks?

