Hacker News new | past | comments | ask | show | jobs | submit login
EasyOCR for Paperless Office (medium.com)
42 points by extrawurst 70 days ago | hide | past | web | favorite | 7 comments

I am working on something similar which I will probably open source in the future. It watches a directory for OCR'd PDF files. It has a set of rules files (yaml config) which look for keywords to determine where to put the document, and which dates to use in the name(IE documents from my car dealership have the original date of purchase, I want those documents to use the latest/newest service date, however say documents containing future dates that investments come up, I want to use the oldest or middle date). Has a bunch of regular expressions to capture and parse a vast amount of date formats used in documents. Combined with scanbot for iOS, and a Scansnap document scanner, it works amazingly well and lets my paper shredder get lots of use. I also found that tesseract based OCR gave significantly poorer OCR quality than ABBYY FineReader.

Cool, let us know when you open source it. Regarding ABBYY: It seems to be not free so it did not work for my requirements of all tools being free/open. I also found a comparison of both: http://lib.psnc.pl/Content/358/PSNC_Tesseract-FineReader-rep... obviously its hard to compare and both have their strengths.

I appreciate your focus on open-source here since this feels like a really sensitive security scenario.

I mean, personally, I love the idea of just scanning all of my paper documents into a massive electronic database, but I'd really need to be able to trust the software. If something's closed-source, I'd be worried about everything from spyware to malicious bugs that might intentionally introduce errors (e.g., social-engineering attacks, or simple trolling) to unintentional bugs.

Not that open-source is a magical fix to everything, but it's a vast improvement over closed-source solutions.

I cannot say what has happened/changed lately (meaning the last several years) but besides the test above (which is mainly about Polish antique documents, printed before 1850 and in antiqua and gothic fonts) I needed for work a reliable OCR software (reading "normal" modern documents, almost invariably printed in Arial or New Time Roman) there was simply no match, the difference between the accuracy of Tesseract (very, very low) and Finereader (very high) was very noticeable.

In any case, OCRed documents needed anyway some serious editing/correcting

I also suspect that part of the good (or bad) results might be connected to the language in which the documents are written (Italian in my case) and the "width" of vocabulary used, and to the way the document is structured, in my case ther were often tables that for some reason were easily identified by Finereader and rarely or never by Tesseract.

In any case, I would say (without an actual measurement having been made) that roughly Tesseract was below or around 60% accuracy where Finereader was around 80%, possibly a little bit more.

But even if it was the best one (we tested also other softwares, cannot remember the names) even Finereader was far from being a "set and forget" kind of tool.

I would be curious to know how would you rate the word accuracy of your solution.

Anyway - and only as another data point - it was year 1986 (or possibly 1987) when I had Xerox representatives tell me that "soon" (meaning no more than a few years) the office would have become "paperless".

I can recommend openpaper[1] for desktop usage. Combines (multi page) scanning, ocr, automatic labeling and search into a nice gui.

If you have a network enabled scanner than can scan straight to an ftp folder, these two server based projects [2][3] offer even more convenience.

[1] https://openpaper.work/en-us/ [2] https://github.com/the-paperless-project/paperless [3] https://www.mayan-edms.com/

As an aside, I've recently been playing with a CZUR (https://www.czur.com/) ET16 book scanner and I'm reasonably impressed. It is not as nice as a high quality book scanner, but it has the advantage of being cheap and actually obtainable. And it produces decent scans.

The OCR is not too bad, but one of the things I'm going to do once I pop a few other projects off the stack is to see about using tesseract (and possibly custom training for it) on the generated raw TIFF files.

Any recommendations on scanners? Is the Fujitsu ScanSnap still the best?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact