
How to Run SQL on PDF Files - vahidfazelrezai
https://rockset.com/blog/how-to-run-sql-on-pdf-files/
======
seba_dos1
For some reason, when clicking this link, I've expected SQL database running
inside PDF files. Which considering how PDFs can embed JavaScript and that
Emscripten exists (and that PostScript itself is Turing-complete if you want
to go hardcore), may actually be doable.

~~~
avinium
I briefly looked at running Javascript inside PDF a couple of years ago - I
think this is possible under the spec, but no major renderers allow it.

Could be wrong.

~~~
DanielDent
PDFs that work in Chrome's PDF renderer:

A calculator:

[https://pspdfkit.com/images/blog/2018/how-to-program-a-
calcu...](https://pspdfkit.com/images/blog/2018/how-to-program-a-
calculator/calculator.pdf)

A game of breakout:

[https://github.com/osnr/horrifying-pdf-
experiments/raw/maste...](https://github.com/osnr/horrifying-pdf-
experiments/raw/master/breakout.pdf)

The PDF specification is hundreds of pages.

~~~
WrtCdEvrydy
I wonder if running PouchDB would count... since it's an offline / online
database.

I'm wondering if having a PDF that would have data that could self-update if
there's an internet connection.

Anyone wanna collaborate on something like this?

I wonder if having a PDF like this that showed you how your stocks of choice
are doing would be interesting.

~~~
yomly
Would would be the advantage of this over a website/PWA?

That it's more portable? It's more document-like/printable?

~~~
WrtCdEvrydy
Self-contained, no need for internet unless you want it to update, is forward-
compatible with future PDF readers.

------
msravi
So the tool is being used to extract text and then a regex is used to extract
the relevant fields/values. Seems that pdftotext[1][2] with awk can do the job
on your local machine without uploading your docs.

1\. brew install pkg-config poppler (on mac)

2\. sudo apt-get install poppler-utils (on Debian/Ubuntu)

~~~
dexcs
Analyzing the text is the problem. Not extracting. Are there any good open
source libs out there?

~~~
msravi
Sure. But the tool posted here doesn't do that. It merely extracts text, and
the "analysis" is a couple of regexes that are tailor-made for that particular
pdf. Awk can do that much and a lot more.

If you want to extract tables from a pdf, there's Tabula[1], but it isn't
automated to run over the whole pdf - you've to do a manual rectangular
selection around the table you want to extract.

1\. [https://github.com/tabulapdf/tabula](https://github.com/tabulapdf/tabula)

~~~
mirimir
Indeed. Many years ago, I "ran SQL" on a couple decades of Usenet newsgroup
data. Extraction and manipulation involved a bunch of grep, sed, tr and awk
(and millions of tmp files). But, as with PDFs of utility bills, it was very
specific regex.

~~~
kwadhwa
Hey, Kshitij from Rockset here.

With Rockset you can avoid ETL when it comes to extracting and manipulating
the data. Also, the main value here is that you can join this data with other
data sets that are in JSON, CSV, XLS or Parquet formats using SQL to help in
analysis.

~~~
mirimir
Maybe you could add modules for extracting and manipulating data from popular
sources. Such as the most popular social media. Also Amazon, Craigslist, Ebay,
etc. And the main search engines.

There are _many_ people who want usable data from such sources. And your
service wouldn't be doing any scraping, so you'd probably be OK legally. But
IANAL, so do check.

------
voltagex_
[https://rockset.com/pricing/](https://rockset.com/pricing/) \- free tier
looks pretty good but I'm not sure I'd be comfortable uploading bills and
other documents here.

~~~
WrtCdEvrydy
I'm sure they take 'your privacy and security seriously'

~~~
aasasd
Doubly so after an incident.

~~~
kkarakk
but they have [Standard encryption] and [Field masking]on the free tier!
surely that means something?

------
mehrdadn
What do you do if you don't want to upload your stuff to somebody's server?

------
aboutruby
So it extracts the metadata and converts the PDF to text and puts that
automatically in a BigQuery table (with some custom functions).

I was somehow expecting to have the parser automatically recognize patterns in
the PDF and maybe try to name them (and let you rename them), kind of what
advanced web scrappers do.

------
mLuby
If the author is reading, I'd love to see more info on how you trained the
system to understand the text content of the PDF. And how regular/structured
do the PDFs have to be to work?

~~~
sumedh
Not the author, there is no training here, its using pdf to text (there are
various libraries to do that) and then using a regex pattern to extract the
relevant information.

Since its regex only techies would be able to do that in that case they can
just write their own script instead.

~~~
kwadhwa
Hey, Kshitij from Rockset here.

You are correct that Rockset is doing text extraction for PDF but the main
value here is that you can join this data with other data sets that are in
JSON, CSV, XLS or Parquet formats using SQL without doing any ETL.

------
iblaine
This is a confusing feature. Converting PDFs to text is already a trivial
technical challenge, so it's not a product differentiator. What is a product
differentiator is converging indexing(index and store all the things) used by
Rockset. This PDF feature seems like a distraction.

Also, I uploaded my PG&E bill to Rockset and got an empty result set...maybe
I'm using it incorrectly.

