
Extracting tabular data from U.S. Senators' scanned-in personal finance reports - danso
https://github.com/dannguyen/abbyy-finereader-ocr-senate
======
cyanbane
Very neat project. Tesseract seems to be a lot of projects default go to now
on OCR. I had not heard of FineReader until now.

My question is why is it still acceptable for them to submit via paper? Who
determines the submission requirements? I am assuming it is the Houses
themselves - I mean what body within those houses determines the requirements?

edit: Seems Secretary of Senate keeps up with and enforces - I wonder if also
makes or if it quite literally takes "an act of Congress" to alter
requirements.
[https://efdsearch.senate.gov/search/home/](https://efdsearch.senate.gov/search/home/)

~~~
mdaniel
Tesseract is popular because it is open source, in my opinion. I've had a
great deal of frustration using it, but to be honest I think OCR falls into
the same valley as trying to implement an office suite, or reimplement the
Windows api - high expectations and only so many spare cycles to work on it.

I also have FineReader for Mac but I've had contact with some of Abbyy's more
expensive stuff and it's really incredible. I wouldn't recommend any other
system, if asked.

------
chatman
Using non-free software, i.e. ABBYY Finereader, is a privacy mistake, if this
is to be used on personal data. On publicly available data, using a commercial
non-free software isn't even close to being innovative; iow I don't see a
value addition to using this program as opposed to using the batch scanning
features of the software directly. What is the point of this project?

~~~
danso
Hmmm...not sure where the difference in understanding is here. You're asking
if this could be a privacy violation? Is it not clear that the U.S. Congress
is required to post these forms on us.gov websites, making them accessible to
all? Are you unaware of the difference between a digital image and digital
text, that you don't understand how one is profoundly different than the
other? Help me out here

