
Extracting campaign finance data from gnarly PDFs using deep learning - danso
http://jonathanstray.com/extracting-campaign-finance-data-from-gnarly-pdfs-using-deep-learning
======
lalaland1125
My personal bet is that you would have probably gotten better results by
simply feeding your data into xgboost or lightgbm.

This current fad of focusing on neural networks just seems sorta silly when we
have simpler and (often) better performing models. Just look at what sorts of
models win kaggle competitions.

~~~
steve19
Maybe but I think 90% of the work is wrangling the data and understanding the
problem, it wouldn't be that hard to substitute gradient boosting for the NN

------
sweeneyrod
Deep learning (and indeed any kind of machine learning) is really not
necessary here. I wrote a very simple baseline that estimates the total gross
amount as the biggest numerical value present (favouring ones that begin with
a dollar sign and/or occur more than once). This achieves 93% accuracy (code
is here
[https://gist.github.com/rlmacsween/166b7c1c1b0c5f466a0fe9b46...](https://gist.github.com/rlmacsween/166b7c1c1b0c5f466a0fe9b467d61537))
in comparison to the 90% for the deep learning model.

~~~
jonathanstray
Nice! An excellent baseline, something to beat.

------
ChuckMcM
This is a problem that fascinates me as well. How humans effortlessly look at
a document with tables and columns Etc. and effortlessly extract all of the
'factoid' bits from the document and often their relationships. Back at IBM
there had been some work on extracting tables from PDFs but it turns out to be
a pretty challenging problem to generalize.

------
manojlds
Looks like a very shallow article that hasn't actually solved the meat of the
problems. Am I reading it wrong?

~~~
jonathanstray
Heh. It’s my work, so maybe I can clarify the goals. This is meant to be a
proof of concept. I took a week and was able to show that relatively simple
deep learning techniques are capable of generalizing over unseen form types
with high accuracy. I also showed that tokens-plus-geometry is a viable
format, and that hand-crafted feature engineering is still necessary (and
still used in SOTA approaches).

I also believe that preparing and cleaning this data set, and bringing a
challenging investigative journalism problem to the attention of other
researchers, would be valuable even if I hadn’t done any work on this baseline
solution. This is a problem that journalists currently expend a huge amount of
time and money on, which reduces the effectiveness of transparency around
political ad spending information.

~~~
coffeecat
I'm curious if you considered or attempted to use pdfplumber's table
extraction methods to separate tabular from non-tabular text. That would be my
starting point on a problem like this, as picking the relevant rows of a table
is far easier than picking from the set of all tokens. By the way, when you
say tokens, you're referring to non-whitespace characters separated by
whitespace? How reliable have you found pdfplumber to be in picking out
words/tokens?

~~~
jonathanstray
I didn’t try separating out tables because the total field isn’t actually
“inside” the table in many cases. Certainly the other fields I want are not.

pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine
tokens that should be separate. I suspect a few percent of the error is
actually problems earlier in the data pipeline, as opposed to the model
proper.

------
ma2rten
Accuracy may not be the best metric, if your dataset is imbalanced.

~~~
jonathanstray
It’s not a binary classifier. Every invoice in the dataset has a total amount
written somewhere on it. Accuracy here is whether the network chooses the
correct token from each PDF.

------
jonathanyc
> This project aimed to find out, and successfully extracted the easiest of
> the fields (total amount) at 90% accuracy using a relatively simple network.

Why does this page not even say what this field was? For all we know, OpenCV
would have worked better.

~~~
oaeide
As it states, the field in question is "Total amount"

------
dlphn___xyz
how does the performance compare to a simple ocr reader?

~~~
steve19
It operates on ocr output. It's job is to identify the total expenditure by
looking at ocr output.

