My personal bet is that you would have probably gotten better results by simply feeding your data into xgboost or lightgbm.
This current fad of focusing on neural networks just seems sorta silly when we have simpler and (often) better performing models. Just look at what sorts of models win kaggle competitions.
Maybe but I think 90% of the work is wrangling the data and understanding the problem, it wouldn't be that hard to substitute gradient boosting for the NN
Deep learning (and indeed any kind of machine learning) is really not necessary here. I wrote a very simple baseline that estimates the total gross amount as the biggest numerical value present (favouring ones that begin with a dollar sign and/or occur more than once). This achieves 93% accuracy (code is here https://gist.github.com/rlmacsween/166b7c1c1b0c5f466a0fe9b46...) in comparison to the 90% for the deep learning model.
This is a problem that fascinates me as well. How humans effortlessly look at a document with tables and columns Etc. and effortlessly extract all of the 'factoid' bits from the document and often their relationships. Back at IBM there had been some work on extracting tables from PDFs but it turns out to be a pretty challenging problem to generalize.
Heh. It’s my work, so maybe I can clarify the goals. This is meant to be a proof of concept. I took a week and was able to show that relatively simple deep learning techniques are capable of generalizing over unseen form types with high accuracy. I also showed that tokens-plus-geometry is a viable format, and that hand-crafted feature engineering is still necessary (and still used in SOTA approaches).
I also believe that preparing and cleaning this data set, and bringing a challenging investigative journalism problem to the attention of other researchers, would be valuable even if I hadn’t done any work on this baseline solution. This is a problem that journalists currently expend a huge amount of time and money on, which reduces the effectiveness of transparency around political ad spending information.
I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens?
I didn’t try separating out tables because the total field isn’t actually “inside” the table in many cases. Certainly the other fields I want are not.
pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.
I think the parents critism was more that the article was a little light compared to thr video. For example, you didn't have screenshots of the scanned pdfs in the article.
It’s not a binary classifier. Every invoice in the dataset has a total amount written somewhere on it. Accuracy here is whether the network chooses the correct token from each PDF.
> This project aimed to find out, and successfully extracted the easiest of the fields (total amount) at 90% accuracy using a relatively simple network.
Why does this page not even say what this field was? For all we know, OpenCV would have worked better.
This current fad of focusing on neural networks just seems sorta silly when we have simpler and (often) better performing models. Just look at what sorts of models win kaggle competitions.