
Table Detection and Extraction Using Deep Learning - ole_gooner
https://nanonets.com/blog/table-extraction-deep-learning/
======
willvarfar
I've worked with several companies that try to parse things in PDF documents,
extracting tables and paragraphs etc. This is actually challenging because a
PDF is a large bag of words and fragments of words with x y positions. There
is a particularly popular word processor that emits individual characters.
Just determining that two fragments are part of the same word is challenging
as is detecting bullet points, etc.

The AI approaches are definitely still worse than human-written rules. I can
infer - and I've chatted with the devs to confirm - from the quality of the
text and table extraction whether the company is using a modern NN approach or
someone has sat down and handwritten some simple rules that understand indents
and baselines etc.

~~~
UglyToad
I had to check we hadn't worked for the same company! Yeah, text extraction
and layout analysis from PDFs is a super interesting challenge and still
relatively underdeveloped. I'd put table detection at about the hardest
challenge in that field.

One of the contributors to the PDF library I'm developing has been
implementing some interesting algorithms for layout analysis
[https://github.com/UglyToad/PdfPig/wiki/Document-Layout-
Anal...](https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis)

~~~
ahpearce
Referring to the above poster's "non-locality", are we talking about
denormalization of formatting? Is there a way to "normalize" PDF structure?
Calculate margins or common formats beforehand to normalize?

~~~
fallous
I believe the reference is to logical locality, specifically in the case of
PDF that transforms and such are essentially atomic and there's no real
boundary layer in which you may say "transform X and transform Y are
equivalent within this local finite domain."

There really is no real differentiation between formatting and content in a
PDF, so it's not possible to truly separate them.

------
chriskanan
With collaborators at Adobe Research, my lab published a paper recently
showing how to do table reconstruction from infographics (e.g., bar charts)
using deep learning [1].

While it isn't the sexiest project, I've had a number of companies reach out
about the project. Human written rule-based approaches are pretty bad at the
task, and even humans doing it manually aren't great (likely due to
sloppiness).

[1] [https://arxiv.org/abs/1908.01801](https://arxiv.org/abs/1908.01801)

~~~
jessaustin
I've found that when PDFs are produced by a single entity for a particular
purpose, I can automate this pretty well with a loop and some regex... maybe
I've just gotten lucky?

------
theSage
For what it's worth, at my previous place we built a YOLO based model for
detecting paragraphs/tables/headlines/page layouts mixed with traditional rule
based OCR/layout detection.

[https://www.youtube.com/watch?v=VVdHFqhQRUk](https://www.youtube.com/watch?v=VVdHFqhQRUk)

[https://voody.clapresearch.com/](https://voody.clapresearch.com/)

------
bondolo
This capability has a lot of value for accessibility. Recovering the table
structure for logical presentation allows navigation by blind users as well as
users who are not using a pointing device.

It is disappointing just how haphazardly most PDFs are structured. Too many of
the PDF production tools remove all document structure metadata or fail to
include it by default.

------
jjohansson
Disclaimer: I work for PDFTron

This is a very interesting field, and PDFTron has been doing similar work with
ML and table extraction as part of our document understanding platform. We've
made pretty good progress over the past year -- you can try it on your docs
here:

[https://www.pdftron.com/pdf-tools/pdf-table-
extraction/](https://www.pdftron.com/pdf-tools/pdf-table-extraction/)

We also have a rules-based table extraction product (PDFGenie) that works
reasonably well, but ML is most definitely the future.

------
Pandabob
I applaud the Nanonet folks for starting a business around AI API's, and it
seems clear there's lots of value to unlock with solutions like these.

But does anyone have insight how hard is it to be in a space where all the big
cloud providers seem to be offering very similar products? Can you survive by
focusing on a niche segment? Is the market growing so fast that there's room
for multiple companies offering (roughly) the same thing?

~~~
ackbar03
I think it's pretty hard and even the big cloud providers dont necessarily
have a perfect solution. It's not a particularly creative idea to come up with
I think. I've thought about making something similar as a product but I'm
kinda glad I didn't

------
nanoamp
I can see the use-case and potential for ML in exfiltrating tables, but I'd be
worried about the potential for decision-making mistakes in environments the
author identifies, such as finance.

The example of TableNet using deep learning for table extraction on top of
tesseract for OCR means two layers of ML, either of which could individually
introduce pathologies without human oversight. It reminds me of the
photocopier that changed numbers for you -
[https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...](https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_means_dodgy_numbers_and_dangerous_designs/)

If an ML engine was trained to be able to do things like look for totals and
sub-totals in numerical tables and flag errors in summation, then that would
clearly add more value in parsing for moderation (the use-case described at
the end). But that doesn't seem to be something that's yet... on the table.

~~~
fny
There's a project from Microsoft Research that's really interesting which does
just that:

[https://www.microsoft.com/en-
us/research/publication/melford...](https://www.microsoft.com/en-
us/research/publication/melford-using-neural-networks-find-spreadsheet-
errors/)

~~~
nanoamp
It looks like it's not quite the same thing, in that it identifies Excel
values that should be formulae. It could be used in a pipeline with
spreadsheets extracted by ML/OCR to reverse-engineer formulae though, which is
an interesting prospect.

------
lowdose
There is a lot of data locked up in pdfs but even more so in images.

I would like to have a neural net that can give me the data from a chart in an
image. I have a hunch image segmentation NN's are able to do this because the
size of the surface has a predictive value in the causal relationship.
Artificial data could be created at scale with the Google Sheets api.

------
ivan_ah
Also in the extracting-structured-data-from-PDFs solution space, there is
Parsr which was recently posted on HN: [https://github.com/axa-
group/Parsr](https://github.com/axa-group/Parsr) see
[https://news.ycombinator.com/item?id=22035258](https://news.ycombinator.com/item?id=22035258)
It's based on a pipeline of various js modules and pluggable backends (e.g.
tesseract, GCP cloud vision, Abbyy API, etc.)

For tables with numbers in them, it worked pretty well, but I'm yet to find a
tool that can parse/understand documents where the entire page is a table
layout with lots of merged cells. I think even for humans it's hard to
understand the structure in those cases...

------
yorwba
Related: table extraction using mixed integer programming to encode
constraints:
[https://news.ycombinator.com/item?id=21256005](https://news.ycombinator.com/item?id=21256005)

------
cafard
I was very impressed with "Camelot" ([https://camelot-
py.readthedocs.io/en/master/](https://camelot-py.readthedocs.io/en/master/)).
My impression was that it extracted maybe 80 or 90% of the text properly, far
better than anything else I had tried.

------
tastyminerals
Table detection is useful for line item extraction from financial documents
and it is solvable. However, generic table extraction is very difficult.

------
busymom0
Partially related - is this what someone could use to detect a sudoku grid?
The spaces and the digits from a picture?

~~~
sarthakjain
Sudoku is a relatively simpler problem since the structure is known apriori
and becomes as simple as pattern matching.

~~~
ovi256
Exactly, sudoku can be solved with classical CV through OpenCV, see for
example
[https://www.youtube.com/watch?v=QR66rMS_ZfA](https://www.youtube.com/watch?v=QR66rMS_ZfA)

He's using a CNN for digit recognition.

------
mushufasa
see camelot [https://camelot-py.readthedocs.io/en/master/](https://camelot-
py.readthedocs.io/en/master/)

------
PaulHoule
Woohoo!

------
Animats
Table extraction has been a feature of better OCR programs for at least a
decade. It's easier than the OCR part. Look up "OCR table" for examples,
products, code, papers, etc.

~~~
m1sta_
You're wrong.Robust and easy to use table extraction might be solvable, but
from a business perspective it isn't solved.

~~~
saradhi
Did you try [https://extracttable.com](https://extracttable.com)

The mentioned service is not perfect either. There are always limitations,
minimizing is the key.

P.s: I work with the team at extracttable

