Ask HN: OCR Solutions

staticautomatic · on Nov 9, 2015

It really depends on how structured and diverse your input images/documents are.

I've worked with just about all the tools out there and my conclusion is:

OCR engines by themselves don't differ much in accuracy. The vast majority of my tests involving tesseract, ABBYY's Cloud OCR API, Microsoft's Cloud OCR third party API, etc., have all produced nearly identical results.

If you're extracting data from predictable, structured or semi-structured input images/documents, the best approach by far is to use data extraction software with zonal or relational OCR capabilities, like ABBYY FlexiCapture or Nuance Omnipage. Neither offer cloud API's but they do sell SDK's if you want to build something out yourself. They are expensive, however. I believe the majority of, say, automatic invoice recognition systems use these or something like them on the back-end. ABBYY is very lenient with the duration of trial licenses. You can purchase a license to FlexiCapture's standalone product and automate around it, which I've done successfully, but it processes everything sequentially and can't multiprocess (you need the SDK for that). OmniPage is much cheaper but they are way less lenient about extending the trial license.

The technically correct explanation for why is frankly above my head, but I'll represent that adding zonal/relational OCR into the mix dramatically increases accuracy in almost very application. The difference between it and programatically parsing the text or hOCR output from a given engine is night and day.

Unfortunately, the above tools are all tethered to Windows VM's. If you need to run on Linux you could try to build something with OpenKM (Java) or OCRopus (multiple language bindings). For my own uses, the build/buy analysis seemed to always favor buying, even something turnkey.

I would consider a BPO solution (human workers on the back-end) to be a last resort. When I ran a battery of tests with Mechanical Turk, I found the response times too variable and long to work well in any situation where you need to reliably get the data quickly.

Let me know if you want to talk further and I'll shoot you an email.

gerh12 · on Nov 10, 2015

For a commercial solution, Abbyy is the best. For a free solution, https://ocr.a9t9.com/ is almost as good. It uses Microsoft OCR and can easily scan bank statements or receipts.

staticautomatic · on Nov 10, 2015

Maybe it's just my particular use case, but I tried that Microsoft one and the output was identical to Tesseract and ABBYY Cloud OCR.