
A Python Library to extract tabular data from PDFs - leenasoni99
https://blog.socialcops.com/technology/engineering/camelot-python-library-pdf-data/
======
aidos
Cool! That's a good intro too.

Many people don't realise the general weird disconnect in PDFs between real
content and what you see on the screen that makes it hard to recover source
data. In extreme cases you have subset fonts with glyphs ordered completely
differently from how they are in the original and no mapping back to the
character they represent. Then the graphics stream is instructions to draw
glyphs at coordinates. As you can imagine it's quite a battle to get back to
something "raw" (assuming you even had fonts to start with).

~~~
aasasd
If PDF is just "characters at coordinates" then getting the data out seems to
require all functions of an OCR engine outside of character recognition per se
(namely, layout detection). And with botched fonts, you essentially need the
full package.

I so much want to see the day when PDF is dead like Flash.

~~~
pwg
> all functions of an OCR engine outside of character recognition per se

Actually, depending upon how 'obfuscated' an author was attempting to be, you
might need that OCR engine itself.

PDF allows for defining arbitrary mappings from byte values to font glyphs. So
one could define byte value 32 (decimal, usually ASCII space) to actually map
to printing, say, a capitol letter Z instead. One is supposed to provide a
reverse mapping table when one does this that says "a decimal 32 byte prints a
capitol letter Z" to allow for search and extraction purposes. But the PDF
spec. does not require this reverse table be present.

So it is quite possible to randomly assign font glyphs to arbitrary byte
values, and omit the reverse mapping table. The result would be that
extracting data back out of that PDF results in garbage if one does not know
beforehand what the mapping from byte value to glyph was.

So, if a 'bad actor' did this, one's only recourse to retrieving data would be
to rasterize the PDF to a bitmap, then OCR the resulting bitmap to extract the
content back out.

~~~
nuclx
Or perform frequency analysis on the simple substitution chiffre. Seriously
though, we need a document format with easier to extract payloads. Like Office
documents with stronger structure, an underlying schema, along the lines of
react-json-schema-form for Word.

~~~
emj
ODF - Open document format, it is not perfect but a lot better than Microsofts
formats.

------
squaresmile
This sounds neat. Thanks for the work, vortex_ape and others. When I last
needed this, I used tabula via tabula-py. Tried camelot on the PDF [0] I
worked on and unfortunately the default option returned less-workable
dataframe than tabula-py. I think it's just the area detection of stream and
you are working on it anyway so I'm really looking forward to see the results.

btw, I think the pip install requirements missed opencv-python (on Windows?).
And in this doc [2], it should be "top left and bottom right" instead of
"left-top and right-bottom".

[1]
[https://www.boj.or.jp/en/statistics/set/kess/release/2018/ke...](https://www.boj.or.jp/en/statistics/set/kess/release/2018/kess1808.pdf)

[2] [https://camelot-
py.readthedocs.io/en/master/user/advanced.ht...](https://camelot-
py.readthedocs.io/en/master/user/advanced.html#specify-table-areas)

~~~
vortex_ape
Hey squaresmile! Yes, right now table detection with Stream doesn't work
nicely if the table is not present on the full page, for which you can use the
table_area kwarg from [2].

You should use "pip install camelot-py[all]" to install Camelot (which will
install opencv-python too). I had to take it out of the requirements since it
wasn't available in any conda channels while I was creating the conda package.
I'm looking to remove opencv as a requirement altogether by either vendorizing
the opencv code that is being used inside Camelot or reimplementing the code
using something lightweight like pillow.

Thanks for the catch in [2], I'll correct it!

------
sandGorgon
Quick suggestion - you should integrate the functions to extract signature
data inside PDF. This is a huge issue and everyone has to write their own.

for example, this is my sample piece of code to extract data from Aadhaar
signed PDF [https://pastebin.com/dg8p98T1](https://pastebin.com/dg8p98T1)

~~~
vortex_ape
Thanks for the suggestion sandGorgon! Can you also point me to an example of a
PDF with signature data?

~~~
sandGorgon
unfortunately i cannot share without running afoul of all the laws out there.
but you can create your own here -
[https://app.digio.in/#/authenticate](https://app.digio.in/#/authenticate)

~~~
vortex_ape
Ah sorry I forgot about posting PII data online. Thanks for the link!

------
ppod
This is a really good example of how to briefly introduce/sell a library. What
it does, why, how, how to install it, with concrete examples.

~~~
vortex_ape
The API and docs (on which the blog post built upon) were inspired from pandas
and requests!

------
KhalilK
A few months ago I was looking for a similar solution but couldn't find one
that handles empty cells very well. I ended up writing my own program[0] that
is specific to my files' layout.

This library works perfectly and could've saved me a lot of time! Looking at
some of the source code, we used similar logic to parse the tables. Pretty
neat!

0.[https://github.com/khllkcm/pdf2calendar](https://github.com/khllkcm/pdf2calendar)

~~~
vortex_ape
Will check out pdf2calendar!

------
decasteve
This is nice. I do quite a bit of tabular data extraction and pdf tables are
often a sticking point. It is absolutely correct in describing it as a "fuzzy"
problem.

My go-to solution has been 'pdftotext -layout' with a bit of hackery before
giving it to pandas.read_fwf. That usually gets me 80% of the way there 80% of
the time. The upside is that this tends to fail "better" than some other
options.

I look forward to kicking-the-tires with this on my test cases.

~~~
vortex_ape
Do submit bugs on GitHub if you face any issues!
[https://github.com/socialcopsdev/camelot](https://github.com/socialcopsdev/camelot)

------
danimolina
This is a very interesting software. In research community still many results
are only in pdf tables in papers, so obtaining them in dataframe is very
useful, good job!. By the way, I would like export also in Excel files in the
command line.

~~~
vortex_ape
Hey danimolina! You can export the data into an excel by specifying it as the
export format, Camelot comes with a command-line interface too!
[https://camelot-
py.readthedocs.io/en/master/user/cli.html#cl...](https://camelot-
py.readthedocs.io/en/master/user/cli.html#cli)

You can simple do: camelot --output data.xlsx --format excel lattice input.pdf
(lattice can be replaced with stream based on the type of tables in your PDF)

------
plaidfuji
> However, OpenCV’s Hough Line Transform returned only line equations.

Did you try HoughLinesP?
[https://docs.opencv.org/2.4/modules/imgproc/doc/feature_dete...](https://docs.opencv.org/2.4/modules/imgproc/doc/feature_detection.html?highlight=houghlinesp#houghlinesp)

Returns line segment endpoints with a probabilistic Hough Transform. I'm fully
confident your solution works, just wondering if you tried this and why it was
rejected.

~~~
vortex_ape
Hi plaidfuji! I did try HoughLinesP during experimentation. I vaguely remember
(since this was almost 2 years back) getting the actual line segment as a
combination of multiple smaller line segments in all cases (which could then
be combined to form the actual segment using some heuristic). It came down to
getting the actual table line segment out which a combination morphological
transformations and cv2.findContours provided (without the need for another
combining step).

~~~
plaidfuji
Interesting. I noticed you mentioned below that you're trying to get rid of
OpenCV as a dependency - that's really tough. I came from a Matlab background
where image processing was really well-packaged and Python is a total mess.

If you managed to vendor a small portion of OpenCV that contained image i/o,
basic colorspace conversion, thresholding, scaling/rotating, shape
drawing/insertion, HoughLines and findContours, I think you could release that
as its own package and it would be quite popular. OpenCV is such a bloated
dependency...

~~~
jononor
scikit-image contains Hough transforms and the other things you mention?
Though it does depends on scipy and matplotlib, which are kinda big.

------
jacquesm
Oh that is so timely. I've been putting that part of a pipeline I built off
for a while due to the complexity and now I can just plug this in. Super neat.
Thank you very much!

~~~
vortex_ape
What does this pipeline do and what software have you used to implement it?

I have used Airflow in the past to create ETL pipelines, and plugged in
Camelot in one of them to extract tables from PDFs. I also wrote a blog post
about it in case you might be interested. [https://hackernoon.com/how-to-
create-a-workflow-in-apache-ai...](https://hackernoon.com/how-to-create-a-
workflow-in-apache-airflow-to-track-disease-outbreaks-in-india-fd145575efa4)

~~~
jacquesm
Compress scientific papers.

Thank you for the pointers!

------
radarsat1
> When using Stream, tables aren’t autodetected. Stream treats the whole page
> as a single table

I've often wondered if image semantic segmentation methods as used in the ML
community could successfully identify things like "there is a table (or
figure) here, it's not part of the main text". I mean, it seems that humans
should be able to do this even without reading the text so I don't see why a
CNN couldn't.

~~~
worldexplorer
Yes it should work. Definitely worth trying.

------
sdiepend
An alternative:
[https://github.com/jsvine/pdfplumber](https://github.com/jsvine/pdfplumber)

~~~
suba_selvandran
Yeah. pdfplumber is also good for digital pdfs. Curious to know the advantages
of camelot over this.!

------
JDWolf
So many times I have wanted to get this type of data. Visa would send
reporting this way and it would have to be manually copied over. They offered
CSV but there were extra charges associated. There were some pretty good
libraries for paragraph text extraction but the graphs were too tough to deal
with.

------
berti
I hope this is good at extracting register maps from datasheets. That would
save a lot of tedious driver work.

~~~
vortex_ape
Hi berti! I wrote the library and the blog post. Can you point me to some PDFs
which have these register maps?

~~~
berti
Try these: Page 233
[http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-8351-M...](http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-8351-MCU_Wireless-
AT86RF233_Datasheet.pdf)

Page 45 [https://ae-
bst.resource.bosch.com/media/_tech/media/datashee...](https://ae-
bst.resource.bosch.com/media/_tech/media/datasheets/bst-
bmi160-ds000-07.pdf#page45)

Is the library able to handle cells that span multiple columns?

~~~
vortex_ape
I assumed that you're talking about page 33 in the first PDF, since it has
only 225 pages. I extracted Figure 6-23 from it and the table on page 45 in
the second PDF. Here's a gist: [https://gist.github.com/vinayak-
mehta/cf30a5560f1b8ab4c0b25e...](https://gist.github.com/vinayak-
mehta/cf30a5560f1b8ab4c0b25e34e5c6121b)

Yes, Camelot takes care of cells spanning multiple columns! You can check out
the Advanced Usage section for explanation on the keyword arguments I used in
the gist! [https://camelot-
py.readthedocs.io/en/master/user/advanced.ht...](https://camelot-
py.readthedocs.io/en/master/user/advanced.html)

~~~
vortex_ape
Note: I had to decrypt the second PDF using qpdf since the library I'm using
to split a PDF into pages (PyPDF2) doesn't support the encryption type of that
PDF.

Did this: qpdf --decrypt input.pdf output.pdf

------
guyinthebackr0w
I'm curious why the authors didn't contribute this directly to Tabula instead.

------
nerdponx
Interesting. I've used Tabula [0] in the past with great success. I wonder how
this compares.

[0]:
[https://github.com/tabulapdf/tabula](https://github.com/tabulapdf/tabula)

~~~
chedar
They have a detailed comparison with other tools (including Tabula) in the
wiki:

[https://github.com/socialcopsdev/camelot/wiki/Comparison-
wit...](https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-
PDF-Table-Extraction-libraries-and-tools)

------
danso
Great work and write up! HN submissions about PDF extraction seem to be as
reliably popular as threads mentioning bees or bashing Mongo, which I guess
goes to show how pervasive a problem it is.

------
andrew_chris
Is there any decent tool for tabluar data extraction from scanned PDFs?

~~~
kumartanmay
Hey andrew_chris, we're working on it and will be interested to help you.
Please contact me: tanmay [at] inkredo [dot] in

------
catacombs
A Python library to do this is cool, but there's already Tabula:
[https://tabula.technology/](https://tabula.technology/)

~~~
taylorwc
They address Tabula in the post:

>The first tool that we tried was Tabula, which has nice user and command-line
interfaces, but it either worked perfectly or failed miserably. When it
failed, it was difficult to tweak the settings — such as the image
thresholding parameters, which influence table detection and can lead to a
better output.

------
johnyesberg
[https://github.com/invoice-x/invoice2data](https://github.com/invoice-x/invoice2data)
is another one.

------
burtonator
You should also check out pdf.js:

We use it in Polar:

[https://getpolarized.io/](https://getpolarized.io/)

for our PDF management.

It's a pretty robust library and it renders everything on canvas BUT you also
get the raw text in the DOM so you can play with it more as an API for
managing PDFs.

REALLY nice to be able to use web standards when working with pdf.js.

The downside is that the graphics are rendered to canvas so you're only really
getting an image.

------
amelius
Wouldn't it be easier and more generic to have an OCR solution for this task?

~~~
vortex_ape
Hey amelius! Though OCR would provide a generic solution, it would be an
overkill for text-based PDFs. I'm working on getting a OCR solution up since
there's still a lot of data that is trapped inside scanned PDFs and not text-
based ones.

If you have any pointers in the OCR route, do suggest them here, or on this
GitHub issue!
[https://github.com/socialcopsdev/camelot/issues/101](https://github.com/socialcopsdev/camelot/issues/101)

~~~
kumartanmay
Hey vortex_ape, we're also working on extracting data trapped inside scanned
PDFs and recently, we've begun to get good results using DL algos. I am based
in Gurugram, would you like to catch up and exchange experiences?

------
fredley
Can't wait to try this out with Percollate!

~~~
vortex_ape
Doesn't Percollate save web pages as PDFs? If you have a web page with tables,
you can directly use pandas.read_html to extract them!

------
just_myles
I'm always skeptical of these kind of libraries. Whenever I try to use them,
it ends up feeling like a broken promise.

~~~
kumartanmay
Ever researched why it breaks?

