Extracting campaign finance data from gnarly PDFs using deep learning

lalaland1125 · on June 22, 2019

My personal bet is that you would have probably gotten better results by simply feeding your data into xgboost or lightgbm.

This current fad of focusing on neural networks just seems sorta silly when we have simpler and (often) better performing models. Just look at what sorts of models win kaggle competitions.

steve19 · on June 22, 2019

Maybe but I think 90% of the work is wrangling the data and understanding the problem, it wouldn't be that hard to substitute gradient boosting for the NN

sweeneyrod · on June 24, 2019

Deep learning (and indeed any kind of machine learning) is really not necessary here. I wrote a very simple baseline that estimates the total gross amount as the biggest numerical value present (favouring ones that begin with a dollar sign and/or occur more than once). This achieves 93% accuracy (code is here https://gist.github.com/rlmacsween/166b7c1c1b0c5f466a0fe9b46...) in comparison to the 90% for the deep learning model.

jonathanstray · on June 24, 2019

Nice! An excellent baseline, something to beat.

ChuckMcM · on June 22, 2019

This is a problem that fascinates me as well. How humans effortlessly look at a document with tables and columns Etc. and effortlessly extract all of the 'factoid' bits from the document and often their relationships. Back at IBM there had been some work on extracting tables from PDFs but it turns out to be a pretty challenging problem to generalize.

manojlds · on June 22, 2019

Looks like a very shallow article that hasn't actually solved the meat of the problems. Am I reading it wrong?

jonathanstray · on June 22, 2019

Heh. It’s my work, so maybe I can clarify the goals. This is meant to be a proof of concept. I took a week and was able to show that relatively simple deep learning techniques are capable of generalizing over unseen form types with high accuracy. I also showed that tokens-plus-geometry is a viable format, and that hand-crafted feature engineering is still necessary (and still used in SOTA approaches).

I also believe that preparing and cleaning this data set, and bringing a challenging investigative journalism problem to the attention of other researchers, would be valuable even if I hadn’t done any work on this baseline solution. This is a problem that journalists currently expend a huge amount of time and money on, which reduces the effectiveness of transparency around political ad spending information.

coffeecat · on June 22, 2019

I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens?

jonathanstray · on June 22, 2019

I didn’t try separating out tables because the total field isn’t actually “inside” the table in many cases. Certainly the other fields I want are not.

pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.

steve19 · on June 22, 2019

Really enjoyed the video, thank you for sharing.

I think the parents critism was more that the article was a little light compared to thr video. For example, you didn't have screenshots of the scanned pdfs in the article.

ma2rten · on June 22, 2019

Accuracy may not be the best metric, if your dataset is imbalanced.

jonathanstray · on June 22, 2019

It’s not a binary classifier. Every invoice in the dataset has a total amount written somewhere on it. Accuracy here is whether the network chooses the correct token from each PDF.

jonathanyc · on June 22, 2019

> This project aimed to find out, and successfully extracted the easiest of the fields (total amount) at 90% accuracy using a relatively simple network.

Why does this page not even say what this field was? For all we know, OpenCV would have worked better.

oaeide · on June 22, 2019

As it states, the field in question is "Total amount"

dlphn___xyz · on June 22, 2019

how does the performance compare to a simple ocr reader?

steve19 · on June 22, 2019

It operates on ocr output. It's job is to identify the total expenditure by looking at ocr output.