Probabilistic Scraping of Plain Text Tables

danso · on Sept 5, 2013

Where is a HN mod when you need one? The original title of the OP is much more accurate and much more interesting:

"Probabalistic Scraping of Plain Text Tables"

vs the submitted title, "Accurate, efficient and maintainable scrapers"

As someone who's scraped a lot of plain text government data, I've given up on finding a flexible machine-learning way and just hand-coded for specific usecases (because government software changes so rarely). However, the OP presents an interesting strategy (with math that I haven't quite mastered, unfortunately)

tlarkworthy · on Sept 5, 2013

its not so hard,

if you understand how to force boolean algebra into intger form, like:

x + y >= 1, where x={0,1}

is a boolean OR between x, y, then it is easy to expressed in PuLP like:

x = variable("x", lowerbound=0,upperbound=1, cat='Integer') y = variable("y", lowerbound=0,upperbound=1, cat='Integer')

problem += x + y >= 1

tada its really easy to write (once the math is defined, the modelling bit is the hard part, but it gets easier over time)

mrcactu5 · on Sept 5, 2013

[Statistical parsing](http://nlp.stanford.edu/courses/lsa354/) is often used in Natural Language Processing, where can't decide which how to arrange sentences into trees, so they pick one at random among acceptable choices.

I've gotten very frustrated writing rules and regexes and exceptions. This method seems really great when scraping data which is structured but a little bit noisy.

tlarkworthy · on Sept 5, 2013

I got sick of writing rules too, hence the diversion!

yeah so you could extend that tree decision processes this. You could add a cost function on the trees, which includes long range communication between nodes (e.g. the results section of a document includes at least one word commonly associated with results). So you could go to higher levels of abstractions.

However, I say limiting yourself to even a very abstract parse tree structure does kinda exclude the spatial layout of words on a page and the semantic meaning behind the layout (tables, web pages, code formatting!).

MIPs can fully express tree logic so you lose nothing.

mrcactu5 · on Sept 5, 2013

Looks like you were trying to complete data found with Optical Character Recognition by maximum-entropy.

This is awesome. What are some other use cases for this?

With APIs much of the data I use has already been formatted in JSON or CSV already, so I don't get to scrape much anymore.

mjw · on Sept 5, 2013

On the subject of scraping data from OCR'd tables:

I heard from a colleague who moved into finance, that there's a mini arms race going on between some funds(?) who are subject to regulatory requirements to release financial performance metrics but for a variety of reasons would rather not (and certainly would rather not make the data machine readable), and other hedge funds who want to run automated trading strategies off said released figures.

They keep obfuscating the tables to make them harder and harder to parse algorithmically while still remaining theoretically human-readable.

Great fun no doubt for everyone involved.

gwu78 · on Sept 6, 2013

The regulations should require that the disclosures be provided in both human and machine readable format.

volokoumphetico · on Sept 9, 2013

any resources to back this up?

mjw · on Sept 16, 2013

Entirely anecdotal.

joshbradshaw · on Sept 6, 2013

I have imminent use for this algorithm at the start-up I am interning at, but unfortunately the concepts are beyond my understanding. Could anyone please recommend resources that will help me understand and apply this method?

whosbacon · on Sept 5, 2013

I remember attending a thesis defense while I was attending Edinburgh Uni. Is that you?

tlarkworthy · on Sept 5, 2013

um, you are in the right geographic area, but there were no strangers in my thesis defence... so unless you are my supervisor, external or internal examiner its unlikely you were there (sidenote my PhD in robotics not this kind of stuff).