Where is a HN mod when you need one? The original title of the OP is much more accurate and much more interesting:
"Probabalistic Scraping of Plain Text Tables"
vs the submitted title, "Accurate, efficient and maintainable scrapers"
As someone who's scraped a lot of plain text government data, I've given up on finding a flexible machine-learning way and just hand-coded for specific usecases (because government software changes so rarely). However, the OP presents an interesting strategy (with math that I haven't quite mastered, unfortunately)
[Statistical parsing](http://nlp.stanford.edu/courses/lsa354/) is often used in Natural Language Processing, where can't decide which how to arrange sentences into trees, so they pick one at random among acceptable choices.
I've gotten very frustrated writing rules and regexes and exceptions. This method seems really great when scraping data which is structured but a little bit noisy.
I got sick of writing rules too, hence the diversion!
yeah so you could extend that tree decision processes this. You could add a cost function on the trees, which includes long range communication between nodes (e.g. the results section of a document includes at least one word commonly associated with results). So you could go to higher levels of abstractions.
However, I say limiting yourself to even a very abstract parse tree structure does kinda exclude the spatial layout of words on a page and the semantic meaning behind the layout (tables, web pages, code formatting!).
MIPs can fully express tree logic so you lose nothing.
On the subject of scraping data from OCR'd tables:
I heard from a colleague who moved into finance, that there's a mini arms race going on between some funds(?) who are subject to regulatory requirements to release financial performance metrics but for a variety of reasons would rather not (and certainly would rather not make the data machine readable), and other hedge funds who want to run automated trading strategies off said released figures.
They keep obfuscating the tables to make them harder and harder to parse algorithmically while still remaining theoretically human-readable.
I have imminent use for this algorithm at the start-up I am interning at, but unfortunately the concepts are beyond my understanding. Could anyone please recommend resources that will help me understand and apply this method?
um, you are in the right geographic area, but there were no strangers in my thesis defence... so unless you are my supervisor, external or internal examiner its unlikely you were there (sidenote my PhD in robotics not this kind of stuff).
"Probabalistic Scraping of Plain Text Tables"
vs the submitted title, "Accurate, efficient and maintainable scrapers"
As someone who's scraped a lot of plain text government data, I've given up on finding a flexible machine-learning way and just hand-coded for specific usecases (because government software changes so rarely). However, the OP presents an interesting strategy (with math that I haven't quite mastered, unfortunately)