
Extracting Tables from PDFs in Javascript with PDF.js - garysieling
http://www.garysieling.com/blog/extracting-tables-from-pdfs-in-javascript-with-pdf-js
======
unmei
Very nice. I've been doing some table extraction from PDFs recently. Also
check out PDF2JSON for nodejs-based parsing - it grabs all the texts and
positions so you don't have to 'intercept' draw calls and dumps them out in
JSON.

~~~
garysieling
Thanks. I looked into that recently, it does make this a lot easier, so now I
have a node version of this as well.

------
gregwebs
I thought this was discussed on HN before, but I only found this link:
[https://news.ycombinator.com/item?id=6083051](https://news.ycombinator.com/item?id=6083051)

kudos to Gary for packaging this up: [https://github.com/garysieling/pdf-js-
csv](https://github.com/garysieling/pdf-js-csv)

Of course it has issues extracting data from many tables. There is a body of
research literature on how to automatically extract tabular data from PDF (and
other sources) and it is not considered an easy task.

You can always fallback to a manual tool like Tabula. They also have automatic
table detection now, but last I check it only worked on certain kinds of
tables.

I write the PDF table extraction code for docmunch.com. We think we have
figured out how to achieve a very high degree of accuracy in PDF table
extraction and how to make a nice UI for manual intervention. We would love to
hear about your table extraction use cases.

------
mkl
I've done similar (but more single-use) things to extract text from PDFs, and
data from PDF and PostScript plots. PDFs are actually surprisingly easy to dig
into when they're decompressed (e.g. with pdftk), since they're mostly text
based.

------
trez
you can also use pdf2html with the option -x (to get xml). You would also have
the position of each text tokens.

------
briankim
Pretty cool, thank you for sharing

