
Ask HN: How would you extract data from scans of hand-filled paper forms? - webmaven
Hello, HN!<p>I&#x27;ve stumbled across a public data set that consists (largely) of scans of paper forms that were filled out by hand, often in idiosyncratic ways.<p>Besides the usual checkboxes and simple text fields, there are:<p>* illustrations of different options selected on some forms by being X&#x27;d, on others by being circled, and in others by scribbling an identifier for the option (eg &quot;A&quot;)<p>* Supplementary data on separate pages (sometimes with handwritten tables) that are formatted differently for every submitting organization<p>* Various measurements expressed as both fractions and decimals<p>... and other complexities, such as the forms themselves changing over time.<p>How would you solve turning these scans into a clean dataset?
======
DanBC
I'd create an online version of the form and make it very clear and easy to
use.

I'd then mechanical turk it, with some emphasis on accuracy not speed.

~~~
webmaven
Hmm. How much would the data entry cost per page of the form?

~~~
brudgers
If the data has high value, then the issue is capitalization of the business
rather than cost of the process. If the data is not high value, then
committing programmer time might not make any more sense as a business
proposition than paying mechanical turks.

All with the caveat that most data is not high value. And if the goal is to
monetize the data, it is probably better to find a buyer first rather than
processing the data first. It will not only address the capitalization issue,
it will also help prioritize by value the many possible and various ways of
capturing the data in digital form.

~~~
webmaven
As with many other data sets (particularly ones that originate as some sort of
by-product), the value of the aggregate grows with broader use, rather than
exclusivity.

There would likely be so many opportunities to monetize this data indirectly,
that I don't think it makes sense to do anything _but_ make it public and
free. Let a thousand flowers bloom, and devil take the hindmost.

OTOH, any software or capabilities developed to accomplish the work would also
be quite valuable, I think, with this particular dataset serving as a decent
PoC.

------
sochix
I think your problem is too complex for OCR. The best solution is to hire
hundred of guys to do it manually for you. If you want quick and dirty
solution by computer you can try Ambar
([http://ambar.rdseventeen.com](http://ambar.rdseventeen.com))

~~~
webmaven
_> I think your problem is too complex for OCR._

The problem is _definitely_ too complex for OCR, particularly anything that
can't handle hand-written text.

I need a solution that, at a minimum, can look at a scan of the form,
determine which pixels are the form and which are the scribbles filling out
the form (and for which fields), and turn the scribbles into data.

Ambar looks interesting, but doesn't look like it can handle hand-written text
at all (the examples look like they turn the signatures on documents into
line-noise, even ones that are very legible).

