
Introducing Tabula, a human-friendly PDF-to-CSV data extractor - mtigas
http://source.mozillaopennews.org/en-US/articles/introducing-tabula/
======
cs702
Love it: a wonderful gift to millions of students, analysts, journalists,
researchers, and others who for many years have had to extract data from PDFs
via throwaway scripts, copy-and-paste, or (yikes) read-and-retype.

------
polskibus
If they automate table detection, then many low-end "analysts" will be made
redundant. PDFs one of the worst bits for data feed automation.

~~~
kyllo
Yeah, I did this for a living for a little while--I was an analyst whose job
was mostly to read industry quarterly reports in PDF form and condense them
into much smaller reports to give to upper management.

Of course, the data itself is usually also for sale. But a manager would
rather make an analyst scrape it from the PDF report than pay the reporting
company extra for a data subscription, because they prefer to bear the
opportunity cost of not having the analyst work on something more important
and productive.

As an analyst, I can't count how many times I asked my former employer to
shell out a couple hundred dollars a month for market intelligence data
subscriptions and was blown off because they didn't want to allocate a budget
for it.

~~~
polskibus
Just imagine how many "analysts" work for Reuters et al.

~~~
kyllo
A staggering number of people in any large organization are basically working
as a sort of "information filter" to simply condense information and report it
up the organizational food chain. A sufficiently clever combination of OCR,
NLP, and ML could automate a lot of those jobs. In other words, the executive
set needs a Summly for industry intelligence. (Startup idea that I'm sure
someone with VC connections has thought of already)

The trouble with PDFs is they're designed to be consumed by human eyes only.
Any attempt to automatically extract information from them is fundamentally a
hacky scrape-job.

------
danso
Great work, the integration (as shown in the demo) and UX are really well
done. A couple of questions:

1) Why use Python for OpenCV when Ruby has a decent wrapper that can do Hough
(<https://github.com/ruby-opencv/ruby-opencv>)? Or was the Ruby version just
too buggy still?

2) Is there a command-line version planned? I guess it'd be most relevant once
auto-detection is figured out.

~~~
mtigas
1) We’re not actually using Python for OpenCV, just ruby-opencv and possibly
some bindings in Java/JRuby. (I think Python’s in the build instructions due
to a numpy dependency in OpenCV. Though that _might_ be specific to using
Homebrew on OS X. Definitely looking into it soon.)

2) No plans at the moment, though that's an awesome idea.

------
saddino
Wow, nice work! I'm the author Trapeze, a once-shareware (now freeware and
open source) PDF-to-Word/RTF/HTML/PlainText application for OS X. My approach
was similar: trying to squash characters into words via a logical grid to
determine whitespace. My #1 request from customers was to extract tables and I
never had the guts to attempt it. :-)

(For those interested, you can grab Trapeze from mesadynamics.com -- requires
OS X 10.4; source code is a mixture of C++ and Objective-C).

------
xaritas
I probably could have used this recently when I had a project which required a
close encounter with extracting data from PDFs. Fortunately the PDFs were
generated as a report by a VB6 application (!) so they had a fairly regular
format once I figured out the quirks of PDF, as the authors describe here.

I did learn a few neat tricks by doing it myself though. The library I used to
extract the text was none other than Mozilla's own PDF.js, so in the final
version my users could just drag and drop the PDF onto the browser window, and
my little algorithm parsed the tables into arrays, with AngularJS rendering
them as HTML tables.

Obviously computer-vision assisted, general purpose reconstruction of tabular
data is the secret sauce in this project, but if you have the right use case
you can do some cool things in the client. You do have to dig into the PDF.js
internals a bit to figure out how to use it but I'm sure that it will improve
in that respect.

------
manicbovine
I wish I'd read this an hour ago, before I wrote a series of terrible awk,
perl, and bash scripts to process several thousand inconsistently formatted
pdfs.

edit: Nevermind, it wouldn't have helped. I missed the part where automation
isn't yet supported. Either way, this looks like a great tool.

------
nsp
This is fantastic, would saved me dozens of hours as an econ undergraduate.

Semirelated: I used to have a ton of scanned journal articles that I wanted to
be able to read on a kindle without having to scroll across every page, and
came across k2pdfopt. It's a C script that finds word and line breaks in image
based pdfs and rearranges the text so that they'll fit on smaller screens.
It's got a ton of flags you can set and is pretty good and ignoring/cropping
out headers and footers and dealing with pages scanned at an angle.
<http://www.willus.com/k2pdfopt/help/k2menu.shtml> No affiliation with Willus

------
migbac
I am starting a personal project to convert my University schedules from pdf
to an ICS calendar, I'm so glad I heard about Tabula, but like previously said
a command line version would just be wonderful.

------
stcredzero
This is very cool!

Has this kind of thing been done for PDF map data?

I was talking with a friend of mine a month ago about the dismal state of
official crime incidence websites. They're usually just lists of PDFs,
probably because whoever is responsible for the data just uses whatever MS
Word PDF output is available to the office and posts an existing monthly
report as a PDF. This makes online crime data a huge pain in the #ss to
decipher.

I'm sure there's a lot of geographic data this could apply to.

------
leeoniya
this is neat. i'm also doing pdf rasterization and pretty extensive document
analysis in html5 <canvas>, not just tables. unfortunately it's for an
internal tool which will likely form the core of our business but the base
library i wrote and use for it is open sourced at
<https://github.com/leeoniya/pXY.js>

tute and demos are here: <http://o-0.me/pXY/> , some recent commits like
radial scanning aren't documented very well yet but i'll devote some time to
it if anyone needs those. they're mostly useful for interactive analysis.

with some creative algorithms, typed arrays and web workers the speed is
pretty amazing (for something built in js at least). a 1550x2006 pixel
document page analyzes in 1.1s in chrome.

------
alanreid
This is just awesome! Well done!

------
jonjohn84
Tabula is also the name of a programmable logic company doing fpga-like
"3PLDs" where the design implemented varies over time to increase effective
size of the logic fabric. (tabula.com)

------
bnp
Awesome - have needed this so often.

