
EasyOCR for Paperless Office - extrawurst
https://medium.com/@Extrawurst/easy-ocr-for-paperless-office-8c0a3c4962f4
======
LeSaucy
I am working on something similar which I will probably open source in the
future. It watches a directory for OCR'd PDF files. It has a set of rules
files (yaml config) which look for keywords to determine where to put the
document, and which dates to use in the name(IE documents from my car
dealership have the original date of purchase, I want those documents to use
the latest/newest service date, however say documents containing future dates
that investments come up, I want to use the oldest or middle date). Has a
bunch of regular expressions to capture and parse a vast amount of date
formats used in documents. Combined with scanbot for iOS, and a Scansnap
document scanner, it works amazingly well and lets my paper shredder get lots
of use. I also found that tesseract based OCR gave significantly poorer OCR
quality than ABBYY FineReader.

~~~
extrawurst
Cool, let us know when you open source it. Regarding ABBYY: It seems to be not
free so it did not work for my requirements of all tools being free/open. I
also found a comparison of both:
[http://lib.psnc.pl/Content/358/PSNC_Tesseract-FineReader-
rep...](http://lib.psnc.pl/Content/358/PSNC_Tesseract-FineReader-report.pdf)
obviously its hard to compare and both have their strengths.

~~~
_Nat_
I appreciate your focus on open-source here since this feels like a really
sensitive security scenario.

I mean, personally, I _love_ the idea of just scanning all of my paper
documents into a massive electronic database, but I'd really need to be able
to trust the software. If something's closed-source, I'd be worried about
everything from spyware to malicious bugs that might intentionally introduce
errors (e.g., social-engineering attacks, or simple trolling) to unintentional
bugs.

Not that open-source is a magical fix to everything, but it's a vast
improvement over closed-source solutions.

------
gregod
I can recommend openpaper[1] for desktop usage. Combines (multi page)
scanning, ocr, automatic labeling and search into a nice gui.

If you have a network enabled scanner than can scan straight to an ftp folder,
these two server based projects [2][3] offer even more convenience.

[1] [https://openpaper.work/en-us/](https://openpaper.work/en-us/) [2]
[https://github.com/the-paperless-project/paperless](https://github.com/the-
paperless-project/paperless) [3] [https://www.mayan-
edms.com/](https://www.mayan-edms.com/)

------
mcguire
As an aside, I've recently been playing with a CZUR
([https://www.czur.com/](https://www.czur.com/)) ET16 book scanner and I'm
reasonably impressed. It is not as nice as a high quality book scanner, but it
has the advantage of being cheap and actually obtainable. And it produces
decent scans.

The OCR is not too bad, but one of the things I'm going to do once I pop a few
other projects off the stack is to see about using tesseract (and possibly
custom training for it) on the generated raw TIFF files.

------
caseyf7
Any recommendations on scanners? Is the Fujitsu ScanSnap still the best?

