
Ask HN: Trying to Extract Roll Call Votes of Cambridge City Council - theszak
Any ideas, hints, tips, pointers?...<p>Trying to extract Roll Call Votes of Cambridge City Council. The challenge is that Roll Call Votes are embedded in Council Documents, for example<p>http:&#x2F;&#x2F;www2.cambridgema.gov&#x2F;CityOfCambridge_Content&#x2F;documents&#x2F;councilor_votes&#x2F;CMA_4380_20150622_20150622_letter.PDF<p>at<p>http:&#x2F;&#x2F;www2.cambridgema.gov&#x2F;cityclerk&#x2F;cmLetter.cfm?item_id=34162
======
dagw
Start by looking for the words YEA, NAY, ABSENT and PRESENT using simple OCR
or pattern matching. Clip the images into 4 vertical columns the same width as
the text, starting just below the text. Use a horizontal line detection
algorithm to try to find the boundary of each 'box'. Clip out each box and
count the number of black pixels. If it's greater than some level, then that
box contains a tick.

OpenCV should be able to do all these things.

~~~
dagw
Actually forget OCR. Just find the y and x coordinates of each horizontal and
vertical line, and then clip out each box in the grid using those coordinates.
Count black pixels and done.

I hacked together something in python (because I had laundry to do :) ) and it
seems to work pretty well.

~~~
dagw
Quick hack, badly tested, but seems to work:
[https://gist.github.com/dwastberg/9faea8a4ceb6dc05b52e](https://gist.github.com/dwastberg/9faea8a4ceb6dc05b52e)

~~~
dagw
Of course this code breaks if the scan isn't straight.

~~~
semiquaver
You're already using imagemagick's `convert` in that script - there's also
`textcleaner -u`[1] which among other things automatically straightens
slightly crooked text in scans (less than 5 degrees).

[1]
[http://www.fmwconcepts.com/imagemagick/textcleaner/](http://www.fmwconcepts.com/imagemagick/textcleaner/)

------
Someone1234
Ouch. There are at least two ways PDFs can be created, those with embedded
text, and those with a picture of embedded text. This is the latter group.

Effectively what you have here is a bitmap in a PDF which happens to contain a
scan of text. So in order to even begin to extract it, you'll have to extract
the bitmap, then OCR it, but while you OCR it you'll have to try to keep the
location of the different blocks somehow...

You'd need to look at several of these to see how consistent they are. If
they're laid on a flatbed scanner manually, they won't be very consistent.
However if they're scanned via a feeder then it should be extremely similar
each time, and you could hard code in the coordinates of the data you want
(which is extremely fragile, but is the least amount of work).

Then you just OCR the names only, while looking in other boxes for any content
at all.

------
e1ven
This is a fixed-sized problem - There are only so many documents, even if
there are a lot of them.

I'd suggest you set up a job on Mechanical Turk or similar, and pay a small
amount per page to have them re-entered in a format you can more easily read.

------
opless
Hi from the other Cambridge!

Looks like there's a lot of manual work there, unfortunately.

Most of the gov.uk info is the same, scans, pdfs, poorly formatted word and
excel spreadsheets. :(

