Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Extracting campaign finance data from gnarly PDFs using deep learning (jonathanstray.com)
95 points by danso on June 22, 2019 | hide | past | favorite | 16 comments



My personal bet is that you would have probably gotten better results by simply feeding your data into xgboost or lightgbm.

This current fad of focusing on neural networks just seems sorta silly when we have simpler and (often) better performing models. Just look at what sorts of models win kaggle competitions.


Maybe but I think 90% of the work is wrangling the data and understanding the problem, it wouldn't be that hard to substitute gradient boosting for the NN


Deep learning (and indeed any kind of machine learning) is really not necessary here. I wrote a very simple baseline that estimates the total gross amount as the biggest numerical value present (favouring ones that begin with a dollar sign and/or occur more than once). This achieves 93% accuracy (code is here https://gist.github.com/rlmacsween/166b7c1c1b0c5f466a0fe9b46...) in comparison to the 90% for the deep learning model.


Nice! An excellent baseline, something to beat.


This is a problem that fascinates me as well. How humans effortlessly look at a document with tables and columns Etc. and effortlessly extract all of the 'factoid' bits from the document and often their relationships. Back at IBM there had been some work on extracting tables from PDFs but it turns out to be a pretty challenging problem to generalize.


Looks like a very shallow article that hasn't actually solved the meat of the problems. Am I reading it wrong?


Heh. It’s my work, so maybe I can clarify the goals. This is meant to be a proof of concept. I took a week and was able to show that relatively simple deep learning techniques are capable of generalizing over unseen form types with high accuracy. I also showed that tokens-plus-geometry is a viable format, and that hand-crafted feature engineering is still necessary (and still used in SOTA approaches).

I also believe that preparing and cleaning this data set, and bringing a challenging investigative journalism problem to the attention of other researchers, would be valuable even if I hadn’t done any work on this baseline solution. This is a problem that journalists currently expend a huge amount of time and money on, which reduces the effectiveness of transparency around political ad spending information.


I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens?


I didn’t try separating out tables because the total field isn’t actually “inside” the table in many cases. Certainly the other fields I want are not.

pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.


Really enjoyed the video, thank you for sharing.

I think the parents critism was more that the article was a little light compared to thr video. For example, you didn't have screenshots of the scanned pdfs in the article.


Accuracy may not be the best metric, if your dataset is imbalanced.


It’s not a binary classifier. Every invoice in the dataset has a total amount written somewhere on it. Accuracy here is whether the network chooses the correct token from each PDF.


> This project aimed to find out, and successfully extracted the easiest of the fields (total amount) at 90% accuracy using a relatively simple network.

Why does this page not even say what this field was? For all we know, OpenCV would have worked better.


As it states, the field in question is "Total amount"


how does the performance compare to a simple ocr reader?


It operates on ocr output. It's job is to identify the total expenditure by looking at ocr output.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: