
Tabula: Extract Tables from PDFs - yrochat
http://tabula.technology/
======
sourc3
I have been working on a side project that needs to read dynamic table layouts
and extract financial information. I was excited to hear about Tabula a few
weeks ago but I had 0 success in getting even one PDF extracted.

I ended up using pdfquery package in python which heavily utilized PDFMiner
under the covers.

Besides ABBYY soft (which is proprietary, licensed), does anyone have other
recommendations?

~~~
baldfat
I can't help but say I refuse to work with PDF files. I will email and do a
ton of meetings and one on ones to explain that PDF is a container and that
the format inside the container is the battle. Just give me the plain format
and if it cost the company money it is worth it.

~~~
leejoramo
Much of the use of these tools is to extract data from government or corporate
sources that while required to publish the information may not want make it
easy to access. Thus they prefer PDF's.

Those of us trying to extract the data bound up in these PDF's do advocate to
get access to the original data, but we have to deal with what we have today.

~~~
baldfat
And this is not good for anyone and is the opposite of the spirit behind the
Sunshine Laws.

My school district (What a mess) publishes images (Horrible bad images) of all
the school notes including all financial information and spreadsheets. I had
to one night type in for 4 hours manually the years budget just to check on
our spending per student. It was $5,400 the lowest in our state.

------
tvanantwerp
Congrats on 1.0! We've been using Tabula in the office to get data, usually
from government sources, out of PDFs. It's been very handy--though I don't
especially love having Java on interns' PCs to use it. But it's worth the
tradeoff to not waste their--and our--time manually extracting that data.

------
bradleyland
Congrats on the 1.0 release guys! We've been using Tabula since the days
before the app packaging. It's been really cool to observe development
progress, and especially to see you guys tackle the problem of distributing as
an application.

------
norea-armozel
If I had this when I was working on extracting the ISIR data fields in the
Department of Education's documentation it would've saved me time. Bleh, it's
a shame it didn't exist then. :(

------
dakotaw
This is positively phenomenal, and the UI is great for non-technical users.
Super, super tool. Thanks so much for developing it and opening it up to the
public!

------
mud_dauber
It bombed on the very first PDF I fed to it. (Admittedly, a technical
datasheet of ~50 pages.)

~~~
jazzido
Hi. Can you share the PDF with us on our issue tracker?
([https://github.com/tabulapdf/tabula/issues](https://github.com/tabulapdf/tabula/issues))
We'd be happy to take a look at it

------
plicense
How does it read data from the PDF? Is there a PDF parser somewhere down
inside the code?

~~~
_delirium
It embeds the free-software version of JPedal:
[https://github.com/tabulapdf/tabula/tree/master/lib/jars](https://github.com/tabulapdf/tabula/tree/master/lib/jars)

Unfortunately it looks like the developers of JPedal decided to discontinue
the LGPL version and focus on the proprietary version, so it's unmaintained
unless someone else picks up development.

~~~
jazzido
Hi. Tabula author here.

We use JPedal for rendering pages as images. For parsing, we use Apache
PDFBox. In the near future, we plan to render the PDFs client side with
Mozilla's PDF.js

~~~
jahewson
It's worth mentioning that PDFBox 2.0 does a great job of rendering PDFs too.

~~~
jazzido
PDFBox 1.8 less-than-great rendering engine forced us to include a separate
library for that purpose only.

Moving to PDFBox 2.0 is also on our roadmap. But the text extraction API in
2.0 has changed a lot too, so porting our engine would require quite a bit of
effort.

Friendly reminder: we're an MIT-licensed open source project, and we're always
open to contributions!

------
kaitai
I used this pretty heavily in May for a recent data-science project and it
really saved my butt. Easy to use, speeded things up a lot. Choked on only one
pdf. Looking forward to seeing the progress made.

------
comrh
I have a bunch of scanned PDFs from an open data request I'm looking forward
to trying when I'm home. My own solution with pytesser was pretty effective
but required a ton of tweaking.

~~~
jessedhillon
I don't think it's going to be able to help you if they're scans. From the
README:

> _Tabula only works on text-based PDFs, not scanned documents. If you can
> click-and-drag to select text in your table in a PDF viewer (even if the
> output is disorganized trash), then your PDF is text-based and Tabula should
> work._

~~~
comrh
Ah thanks, missed that. They gave me half text based and half scans, gotta
love the government.

------
sebastianavina
great project, extracting tables from tables is a task I need to make so damn
frequently

