Does anyone have a recommendation for an OCR solution that can take in bank accounts and extract the data from it reliably? Ideally as SASS.
Tesseract OCR(backed by Google) is not accurate enough for my needs.
I looked into outsourcing it via Mechanical Turk and http://arcgate.com/services/
Best solution so far seems to upload the doc to Google Drive and download the extracted text; http://computers.tutsplus.com/tutorials/how-to-ocr-documents-for-free-in-google-drive--cms-20460
I've worked with just about all the tools out there and my conclusion is:
OCR engines by themselves don't differ much in accuracy. The vast majority of my tests involving tesseract, ABBYY's Cloud OCR API, Microsoft's Cloud OCR third party API, etc., have all produced nearly identical results.
If you're extracting data from predictable, structured or semi-structured input images/documents, the best approach by far is to use data extraction software with zonal or relational OCR capabilities, like ABBYY FlexiCapture or Nuance Omnipage. Neither offer cloud API's but they do sell SDK's if you want to build something out yourself. They are expensive, however. I believe the majority of, say, automatic invoice recognition systems use these or something like them on the back-end. ABBYY is very lenient with the duration of trial licenses. You can purchase a license to FlexiCapture's standalone product and automate around it, which I've done successfully, but it processes everything sequentially and can't multiprocess (you need the SDK for that). OmniPage is much cheaper but they are way less lenient about extending the trial license.
The technically correct explanation for why is frankly above my head, but I'll represent that adding zonal/relational OCR into the mix dramatically increases accuracy in almost very application. The difference between it and programatically parsing the text or hOCR output from a given engine is night and day.
Unfortunately, the above tools are all tethered to Windows VM's. If you need to run on Linux you could try to build something with OpenKM (Java) or OCRopus (multiple language bindings). For my own uses, the build/buy analysis seemed to always favor buying, even something turnkey.
I would consider a BPO solution (human workers on the back-end) to be a last resort. When I ran a battery of tests with Mechanical Turk, I found the response times too variable and long to work well in any situation where you need to reliably get the data quickly.
Let me know if you want to talk further and I'll shoot you an email.