
Probabilistic Scraping of Plain Text Tables - tlarkworthy
http://edinburghhacklab.com/2013/09/probabalistic-scraping-of-plain-text-tables/
======
danso
Where is a HN mod when you need one? The original title of the OP is much more
accurate and much more interesting:

"Probabalistic Scraping of Plain Text Tables"

vs the submitted title, "Accurate, efficient and maintainable scrapers"

As someone who's scraped a lot of plain text government data, I've given up on
finding a flexible machine-learning way and just hand-coded for specific
usecases (because government software changes so rarely). However, the OP
presents an interesting strategy (with math that I haven't quite mastered,
unfortunately)

~~~
tlarkworthy
its not so hard,

if you understand how to force boolean algebra into intger form, like:

x + y >= 1, where x={0,1}

is a boolean OR between x, y, then it is easy to expressed in PuLP like:

x = variable("x", lowerbound=0,upperbound=1, cat='Integer') y = variable("y",
lowerbound=0,upperbound=1, cat='Integer')

problem += x + y >= 1

tada its really easy to write (once the math is defined, the modelling bit is
the hard part, but it gets easier over time)

------
mrcactu5
[Statistical
parsing]([http://nlp.stanford.edu/courses/lsa354/](http://nlp.stanford.edu/courses/lsa354/))
is often used in Natural Language Processing, where can't decide which how to
arrange sentences into trees, so they pick one at random among acceptable
choices.

I've gotten very frustrated writing rules and regexes and exceptions. This
method seems really great when scraping data which is structured but a little
bit noisy.

~~~
tlarkworthy
I got sick of writing rules too, hence the diversion!

yeah so you could extend that tree decision processes this. You could add a
cost function on the trees, which includes long range communication between
nodes (e.g. the results section of a document includes at least one word
commonly associated with results). So you could go to higher levels of
abstractions.

However, I say limiting yourself to even a very abstract parse tree structure
does kinda exclude the spatial layout of words on a page and the semantic
meaning behind the layout (tables, web pages, code formatting!).

MIPs can fully express tree logic so you lose nothing.

~~~
mrcactu5
Looks like you were trying to complete data found with Optical Character
Recognition by maximum-entropy.

This is awesome. What are some other use cases for this?

With APIs much of the data I use has already been formatted in JSON or CSV
already, so I don't get to scrape much anymore.

------
mjw
On the subject of scraping data from OCR'd tables:

I heard from a colleague who moved into finance, that there's a mini arms race
going on between some funds(?) who are subject to regulatory requirements to
release financial performance metrics but for a variety of reasons would
rather not (and certainly would rather not make the data machine readable),
and other hedge funds who want to run automated trading strategies off said
released figures.

They keep obfuscating the tables to make them harder and harder to parse
algorithmically while still remaining theoretically human-readable.

Great fun no doubt for everyone involved.

~~~
volokoumphetico
any resources to back this up?

~~~
mjw
Entirely anecdotal.

------
joshbradshaw
I have imminent use for this algorithm at the start-up I am interning at, but
unfortunately the concepts are beyond my understanding. Could anyone please
recommend resources that will help me understand and apply this method?

------
whosbacon
I remember attending a thesis defense while I was attending Edinburgh Uni. Is
that you?

~~~
tlarkworthy
um, you are in the right geographic area, but there were no strangers in my
thesis defence... so unless you are my supervisor, external or internal
examiner its unlikely you were there (sidenote my PhD in robotics not this
kind of stuff).

