
Camelot: Python library that makes it easy to extract tables from PDF files - mpweiher
https://camelot-py.readthedocs.io/en/master/
======
dang
[https://news.ycombinator.com/item?id=18199708](https://news.ycombinator.com/item?id=18199708)

------
burtonator
I'm surprised pdf.js isn't discussed often as an API for reading data from PDF
files.

It allows you to get raw access to the text but also the visual rendering of
the PDF.

The only downside is that it uses the browser to do this but you could use
chrome headless to make this into an API.

I ended up building this:

[https://github.com/burtonator/pdf-annotation-
exporter](https://github.com/burtonator/pdf-annotation-exporter)

based on that strategy to export PDF annotations.

It ended up evolving to a full PDF and document management system for Electron

[https://getpolarized.io/](https://getpolarized.io/)

... but the nice thing is that even if Polar is a GUI app I can do things with
PDF.js in the future like uploading them to a search infrastructure.

~~~
scrollaway
> The only downside is that it uses the browser to do this but you could use
> chrome headless to make this into an API

Huh? Doesn't pdf.js have the capability to run headless? What in the browser
does it depend on?

Anyway using that in Python is never going to be easy or useful. There's
already good PDF tooling in Python.

------
tedmiston
Camelot also received extensive discussion in this thread which was recently
on the front page all day.

[https://news.ycombinator.com/item?id=18199708](https://news.ycombinator.com/item?id=18199708)

Disclosure: I helped review some of the Python code in the library. I'm really
excited about its applications.

------
lvh
This looks super neat, but only works on generated PDFs, not images, for now.
While it uses imaging tricks to extract lines and table-like structures, it
needs actual text (not images of text) to extract. So, there's an OCR step
missing.

Something that could take a picture of a bill or invoice and structure that
data would be pretty useful, I imagine (even better if it ran on mobile
devices). Even something that used a third party (like AWS Rekognition, whose
text bits are pretty great) and added text would help this library be useful
in more cases, without really having to add features to it.

------
frou_dh
The Python library author community sure is getting a lot of mileage out of
this 'For Humans' marketing meme

~~~
nerdponx
Several (requests, pipenv, maya) are by the same author.

~~~
denzil_correa
The author being Kenneth Reitz

[https://github.com/kennethreitz](https://github.com/kennethreitz)

------
chid
I wonder how this compares with Tabula
([https://tabula.technology/](https://tabula.technology/)) and I did read the
section but Tabula can also be tweak albeit not in the web app.

