
Probabilistic Scraping of Plain Text Tables (2013) - polm23
https://edinburghhacklab.com/2013/09/probabalistic-scraping-of-plain-text-tables/
======
tlarkworthy
Oh a blast from my past! That was some super fun work using mixed integer
programming to fuse hard global knowledge (table structure) with soft
information (OCR predictions). It worked pretty well. My only regret was
spelling the title wrong, it hurts me reading it everytime.

~~~
stereolambda
This is very, very neat and not requiring excessive setup, which would be
needed with practically any kind of ML procedure (manual annotation - probably
defeating the purpose, handling structure). Fitting around 1000 booleans is
indeed not much of a problem.

Would you mind sharing how many rows was it? I know it's probably possible to
reverse-engineer it from other information, but I think it would be
illustrative.

Also, this fragment is interesting:

 _> Whilst integer programming is NP-hard to solve in general, these problem
instances are not pathological instances_

 _Could_ this kind of tabular data have turned out to be pathological? Would
it mean that constraints cannot be met and we had to search the whole space to
ascertain that? I imagine these general solvers don't do specific heuristics
when searching.

~~~
tlarkworthy
Depends on the table, the digikey catalog is thousands of small tables, but
most are around 20 rows. It's worth noting some header rows are multi line,
which is part of the difficulty, the MIP encodes where to start in the ASCII
representation where the header ends and the data starts.

> these problem instances are not pathological instances

With MIP, the solver churns when you get a lot of information suggesting that
one path is likely to be the optimal but actually it's another path but you
have been mislead by red herrings. One cause would be a subset of weak
classifiers are configured incorrectly so they are actively misleading.
Another cause might be that the table structure is very ambiguous and requires
a lot of global deductive reasoning to figure it out.

However, given digikey is not maliciously trying to create difficult to read
tables, I don't think these cases really come up. Misleading weak classifiers
would be a bug, though its sometimes hard to spot if enough of the system
still reaches the correct decision. There is probably some math to assign
utility to certain classifiers though I never looked at it beyond case by case
debugging of decision samples.

Also worth noting this never went into full production due to the surrounding
business being broken for completely non-technical reasons. But I do think the
MIP + MLE is a useful technique for a few different forms of problems where
you want to integrate hard ontological constraints over a fuzzy reasoning
system.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=6334178](https://news.ycombinator.com/item?id=6334178)

------
JadeNB
I'm not sure if mirroring typos in the linked article is the intention, but of
course it should be 'probabilistic', not 'probabalistic', even though the
linked article has it the latter way.

~~~
dang
Fixed above now. Thanks!

