Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: OCR Libraries for Receipt Scanning/Parsing?
69 points by selbyk on April 3, 2021 | hide | past | favorite | 38 comments
I'm interested in keeping tabs on my spending and comparing prices of items I buy at grocery stores, because I tend to not think about it when I need something. I am conscious of the extreme price discrepancies for the exact same items at stores just blocks apart here in NYC, but it's difficult to keep track of the prices of each item at various places to optimize shopping.

I want to build a system that can keep a running tab of my purchases by item, price, and store. I need to find a library that can effectively scan a receipt, recognize the store (usually name, number, address and logo at the top), and differentiate each item label and its price. I plan to manually tag each item label from a store's receipt with the item's barcode the first time it is seen.

I have been sporadically googling the past 6 months but am still unsure which OCR library(s) I should invest my time in. Or how low level I should start. Should I grab a library like tesseract and do my own feature extraction or libs that spit out semi-structured objects with text and hope it returns something similar enough across store receipts to make sense of consistently?

I'm ok with this being an extended project, but I would like some input on choosing a solid library with accurate OCR and advice on how to approach training/parsing from someone with more experience.

Other solutions and advice are also welcome++



If you need 99%+ accuracy go for AWS Mechanical Turk. They are used by Wave Accounting and other office application companies for receipt OCR. For 85-95%+ accuracy any off the shelf solution like Google Cloud ML APIs or AWS textract will be fine. You can get better results with both the cloud APIs and hand rolled ML models if you have a good dataset. For this sort of applications a large quantity of well annotated data is king. If you only have <100 receipts per year and need very high accuracy it might be cheaper to just go with AWS Mechanical Turk end-to-end. You have to pay people to annotate the data anyways if you want to train a model so it might be easier to just stick with humans.


Since/until when was Wave using Mechanical Turk? That sounds like a reasonable thing to do, but they weren't using it when I worked there last year.


This Wave user also wants to know.


I think I mixed it up with expensify. I use a number of bookkeeping software.


Maybe this is helpful: https://nanonets.com/blog/receipt-ocr/

In my Opinion Tesseract is the most sophisticated "free" OCR solution out there. The problem with Tesseract is not its recognition capabilities, but more the preprocessing steps.

  - thresholding
  - deskewing
  - segmentation
  - ...
There is a C# library (non-free), that improves recognition A LOT, just by providing these abilities: https://www.vintasoft.com/vsocr-dotnet-index.html

If you find a good Open Source solution, I would be interested, too...



Hahaha, oh wow, somebody found my Bachelor's thesis project about this! Unfortunately, it suffered significant bit rot. I tried last year to run it again, but I couldn't get it to work anymore :(


For "middle ground" projects like this (criteria: a common enough problem that lots of people _should_ have thought about it -- but it may not be a lucrative core business area -- and there aren't any household-name open source projects that cover it), I often turn to GitHub repository search to see what's available.

Based on that, your best bet might be https://github.com/ReceiptManager/receipt-parser-legacy, which is a Python library built on top of the Tesseract OCR engine. You can use it containerized, in Android/iOS applications, or via your own Python scripts.


NB: it's also previously been discussed on HN: https://news.ycombinator.com/item?id=10338199


I worked on such a project 8 years ago. I actually ended up building my own OCR engine, after annotating manually about 50 receipts (about 8000 characters if I remember correctly). Some of the problems I encountered back then is that snapping a picture of a receipt with your phone will result in weird lighting conditions and angles which will mess with the OCR engine. The second problem is that it's hard to keep the receipt straight while taking the picture, so it will be hard to identify lines in the picture, because they will be curved.

To some extent, all this is solved by some modern APIs, such as what GCP or AWS offer, for doing OCR for you. But as far as I know, there is still one more challenge: interpreting the text. Inferring what each line is, what's the price for which item (some receipts have the price on the same line, some on the next line, some above) is quite hard. I tried to do it with rules (regexes and lots of ifs), but even a 95% accuracy of the OCR engine will trip you up.

You can probably frame this as an ML problem as well, but I don't think you'll find any datasets for this.


As I mentioned in another comment, some friends of mine worked at a startup that's entirely built around receipt scanning and itemization, and your comment aligns with what I heard from them ad nauseam: receipts are hard, in large part because there's just no standard way of putting things onto them.

How do you show subtotals, taxes, and totals? How do you flag that an item is taxable or not? What's part of the header? What format are the numbers? What kind of subtle background text is on the receipt? Is the receipt at an angle? Is the picture taken of just a receipt, or a receipt held up in the air, with stuff behind it?

Sometimes, there are lines on receipts that are just meant to be ignored, maybe for old tax regulations that don't exist anymore but were important when the receipt-printing software was written.

It's a mess.


One thing that could help a lot is trying to get the receipt data from some loyalty card some shops have.

In Romania, almost all big stores have mobile apps which allow you to export your purchase lists. Granted, some of them have dumb outputs (Lidl give you an image of your receipt, so you still have to do OCR on it, Carrefour gives you a PDF), but it does make the problem much easier. Of course, this won't work for your random corner shop.


As others have suggested- this is not a project where stitching together OSS OCR bits is going to yield anywhere near useful results. Overall at multiple levels of the stack the error bars on the tech bits are really wide and narrowing them is still a research project. This is why most of the suggestions are- if you want a workable solution, brute-force cheap human Mechanical Turk is the only option.

However, if you are looking for a project, picking one grocery store with one receipt format and generally limited/consistent product coding schemes is a reasonable thing to plug away on. Speaking personally I did this with Whole Foods receipts for a while and was able to get to almost, kinda usable. But then the pandemic hit and I started ordering delivery which obviates the whole receipt ingestion thing because I can get all those details directly from Amazon (modulo doing some data scraping).

Analytics on food purchases are a tremendously interesting and deeply underexplored space in which there is lots of future commercial potential.


I've had friends work at Sensibill[1] which sells tools (mostly to banks) to build some of what you're imagining having right into banking+expense tracking apps. Not sure if they have anything à la carte but they might have something of value to look at.

1 - https://getsensibill.com/


If you're set on building your own, you're probably not interested in using this: https://blog.google/technology/area-120/stack , but it might be a useful reference.


If anyone is interested in keeping long-term records and functionality, I'd suggest steering away from anything Google, with their track record of killing things.


Always a possibility. I have no concern using this because: "Stack can also automatically save a copy of your documents to Google Drive. That way, should you ever decide to stop using Stack, your documents will be accessible in Drive and easy to export. "


Unless Google closes your account for uploading too many, too few or just wrong documents - or, more likely because of some unspecified violation of something.


US-only for no apparent reason.


I've been experimenting with using tesseract to get information out of scanned tutorial roll sheets, with surprising success. If you ask it for tsv or hocr output, it will give you a bounding box for each word. To extract a student's attendance information, I grep the tsv files for a student ID number or name, get the y position with sed, and combine slices of the page images with Image Magick (in my case I want to see all the handwritten ticks and numbers). You might be able to do something similar looking for numbers on the same line as key words like "Total" or "apples" or whatever. Some of your success will depend on how well you scan the receipts.


Makes me think of this idea I had: the receipt printer developers should just add a feature to allow printing of a QR code that contains all relevant information on the receipt in CSV format. Customers could choose whether they want one and be charged a small fee or if they're fine without.

Unless there's some pressure through government regulation to implement this, it won't happen though ... because who's least interested in customers comparing prices and having transparency in their spendings? The retailers obviously.


I was the tech founder at a company that built this exact technology. checkout51.com (still running but we sold it and I've since moved on)

If you want to chat feel free to reach out, i could talk all day about this stuff.


I had this idea a while ago, tried a number of libraries include Tesseract, and found all the results extremely poor. Be interested to see if one that works is suggested.


In my experience preprocessing the image is extremely important before feeding it into Tesseract. I tried to do the same as OP on a rainy afternoon but shelved the project after it became more of a imagemagick research task than about creating a database of my receipts. I got Tesseract to recognize about 80% of the text but it was still missing some letters from slightly worn-out receipts.


I found easyocr to be the most accurate. Tesseract was meh.


Microsoft's Form Recognizer is pretty good. (https://docs.microsoft.com/en-us/azure/cognitive-services/fo...)

discl@imer - I verk 4 not-Macro-Hard. But, I have no connection to this team.

edit: this might be terribly extra for personal use.


For free ocr and quick prototyping, I use https://ocr.space/receiptscanning - It is easy to use and has a generous free tier of 25,000 free scans each month.

Having said that, I am sure there must be some existing accounting software with built-in OCR? Probably even an app?


I built an app to scan receipts for bill splitting, although your use case is certainly interesting.

Google‘s MLKit is very accurate for on device recognition. You can even feed frames straight from the camera with almost real time results. Your bigger problem will be parsing the results, and handling very inconsistent receipts.


If you have the time then go for MLKit or any other OCR API, tesseract is pathetic for non-scanned/in the wild images, then put your parsers atop of the OCR output.

If time is of the essence simply use AWS Textract & be done with its free tier.


Well you should evaluate ABBYY to see how well it performs as it is one of most widely used commercial applications for OCR.

I used it for years to scan our bank statements (before our bank could export data).

It was the only thing I ever found that handled tabular data properly.


I'm familiar with Camelot, it is used by UI called Excalibur. It is more intended to scan invoices or bank statements. It is perfect for tabularized data. It can handle tables without explicit column edges.


Google launched an Android app called Stacks. It's out of their area120 so it's not a fully supported product. But it scans and upload to Google drive and does some ocr. It's been pleasant to use.


We have a similar project and tried AWS Textract and the Google Cloud Vison API. For us it seems that google ocr gives more accurate results. Pricing is nearly the same.


I recently started using paperless-ng, check it out, perhaps you can build on that. Includes tessarect for ocr for example.


Use textract. Super easy to integrate and results are pretty impressive. Also, it is cheap.


Why not just use Mechanical Turk? You can get receipts done for pennies.


because that is not an OCR Library for Receipt Scanning/Parsing


It's a service so it requires sending your data over the network, but the rest is just ignorable implementation details.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: