Hacker News new | past | comments | ask | show | jobs | submit login
Writing a fuzzy receipt parser in Python (trivago.com)
133 points by andygrunwald on Oct 6, 2015 | hide | past | favorite | 40 comments

It's a shame that receipts don't have machine readable output.

QR codes can hold a little over 1,200 characters, which should be more than enough for most receipts.


Edit: related link: https://www.quora.com/Can-and-how-cash-register-receipts-be-...

I'd be happy if they made the store name, date and total more prominent, it's amazing how often that is hard to find despite it being the most important information to the customer.

Exactly. Even the term for the total changes with every receipt. I disovered all the possible terms you can imagine, like "sum", "total amount", "overall",... (in German obviously)

I think it'd be awesome for some commerce/business organization to champion electronic data on paper receipts.

But I think it's much more likely that we'll sooner see itemized electronic receipt data from the smartphone-based payment brokers (Apple Pay, Android Pay, etc) -- assuming it doesn't already do that.

Partly, because they're designed that way.

Stores that are mandated to issue receipts are giving away trade secrets. If the format was standardized and easily readable then it'd be much easier for competitors to analyze their data.

QR codes can hold even more than that. V40 stores just over 3000 characters.

You could do some pretty amazing things in QR codes by base64 encoding compressed textual data.

Yeah. Some receipts also have barcode on them, but it's not guaranteed that it contains all required information.

Or that it's not encrypted. Probably, most of the time, it's simply a record pointing to a database. Useless to everyone except the creator who has context and access to the database.

Hey, this is pretty cool. I actually tried something similar. (Keeping a list of shop names and matching it with tesseract's results) I was trying hough transform for slight image rotations. I wasn't aware of imagemagick's textcleaner script. That could have save me a lot of trouble :) I got roadblocked by the problem of having various kinds of receipts with absolutely no layout in common. I figured it would need a lot of training for the system to have a decent accuracy and left it for another day.

Yeah, relying on the layout never worked for me as well. For instance even the supermarket address is not standardized. Some shops use:

> Market name > Examplestreet 12 > 19393 Examplecity

while others use:

> Market name > 19393 Examplecity > Examplestreet 12

We're not even talking about the invoice/receipt layout. It's different all the time.

Cool. We did a similar thing at an hack, integrated with Dropbox and an automatic monthly receipt generation. I think most if not all the code should still be in pieces on our github accounts

Argh. I should not comment from the phone. We had similar problems using the layout. In the end we mainly focused on getting the date and the total correct by checking parable pieces of text

Hey, author here. I am happy for all questions or every kind of feedback.

Cool project! Something you might want to keep in mind for future iterations or similar projects -

You ended up with 127 usable scans, and then used all 127 scans as your test data set while developing. You run the risk, here, of 'overfitting' your data set. That is to say, you figure out a set of parameters and techniques that gets a good match rate (98%) on your test data set, but fails when you try it on new data.

There's a useful technique in machine learning and statistics, called Cross Validation. There's different specific techniques, but basically, you should split up your dataset into training (80%ish) and validation (20%ish) sets, randomly sampled. You develop with just the training set, and when you have a good match rate, you test if it also is good on your validation set. This helps you detect if your technique actually generalizes to new input well, or if it just happens to match what you trained on.

Thanks for the hint. This is an important note. I tried to be as generic as possible but still I need to test it with more scans.

This is really cool, nice work! I tinkered with the same thing a while ago. Sadly, that project has been neglected. One thing I was most proud of was developing a way to automatically rotate the receipt image to the most ideal position to better help the OCR. I would notice decent improvements even with a few degrees difference.

Question: have you thought about any better methods for getting the receipts scanned? A big hurdle for me was the time it took to scan everything. I know there are scanners designed for such a thing, and you seem to have used one, but were you satisfied with it?

And thanks for the get_close_matches() hint! I'll try and remember that if I ever start hacking again on this.

Yeah scanning the receipts takes the longest time in the whole process. So you should make sure that you have a fast and simple scanner. I can recommend to buy a simple photo scanner like I did. It's just a couple bucks and it will save you so much time. And the quality was near perfect. Even very old receipts could be parsed after scanning with it.

When it comes to rotation, you can try http://www.fmwconcepts.com/imagemagick/textcleaner/. It will rotate your scan if it's not perfectly aligned.

Ha, cool, didn't know about textcleaner either. Always fun when you write some huge routine to do something that you later find out is a simple command elsewhere. Thanks!

Just a quick FYI. I looked at "Fred's Scripts" a while back for a commercial project I was working on. Have a look at the license, if you ever plan on any commercial work using those scripts.

get_close_matches is such a godsend. I've already used it this morning to deal with matching up 2 lists with about 100 things in each.

One trick I've found is to use it in both directions and then only using results that agree both ways.

Something like this (very terse version):

    closest_or_none = lambda l, options: (difflib.get_close_matches(l, options, n=1) or [None])[0]
    dir1 = [(l, closest_or_none(l, list2)) for l in list1]
    dir2 = [(closest_or_none(l, list1), l) for l in list2]
    one_to_one = set(dir1).intersection(set(dir2))
Interested to hear how you approached the rotation thing.

Rough summary: I generated best fit lines from the alignment of the letters. You know the lines on the receipt consist of straight-aligned letters, so you can calculate lines based on that. I then calculated the average slope of all those lines and then used that average to base the rotation calculation on.

Have you guys thought about scaling this? I'm in the process of doing preprocessing using opencv and then OCR using tesseract. Both on python. Since OCR uses I/O I'm not sure how it will scale. Just wondering if you guys have put any thought into that.

Nope, did not think about it yet. Actually it will be running on my Raspberry Pi in the end. But if I had to scale it, I would probably try putting tesseract into a docker container and start N instances with docker-swarm. The output would go into a shared local volume or even S3. Should be more than enough to scale to a very large cluster.

BTW, you might want to have a look at https://github.com/tleyden/open-ocr

I'm surprised that the detection rate was so low for the prices/taxes/totals since there's such a limited character set. Was the number placement so different from receipt to receipt that you couldn't train the parser to expect numbers in a certain area?

You're right, the format is somewhat standardized. First there's a list of items with prices and then comes the total. So far so good.

But depending on the length of the receipt, the total is always at a different place. Another problem was that the detection rate of tesseract for whole words was pretty low to begin with. So searching for exact words like total or sum didn't work. So even though you could give the parser a rough area to look for the total, you would still need to do some fuzzy matching on "total" and "sum" to get it right.

In the end it's most important that it works for as many cases as possible. Getting to 60% was trivial but after that it got interesting. ;-)

If you are only interested in the sum, can't you ask the user to make a "close up shot" of the sum and not the whole receipt? Worse UX but that should be easier to parse.

The cool thing for me was to get it working without any user interaction. Scan and forget. Sure you could make a close up but then you could also just type it into a form field, right? It works surprisingly well without many tweaks. That's what I like about it.

If you're doing that, why not simply display the image and have the user pick the sum for you? Though, to be honest, that sounds like it'd defeat the purpose of this.

For the next step, and easier name matching... why not export a CSV of your online banking and use names and totals to match? Or are these cash receipts?

Good idea! Most of the time I pay in cash when I'm at a supermarket or at a gas station. So I don't really have the data in online banking. It's kind of common in Germany.

Yeah, this is what I ended up doing. I maintain a little project[1] that parses the CSV files and graphs your spending habits as well.

I also wanted fuzzy string matching, but I ended up writing my own clusterer in C to get the speed I wanted, and then wrapped it up as a python module. Must admit I hadn't considered difflib. The c code now lives in ccan[2]. I've got a patch to add the string vector cosine measurement as a filter which gives quite a big performance boost. As a rough indicator, on my laptop it clusters 3500 transaction descriptions in about 600ms.

[1] https://github.com/amboar/fpos/

[2] http://ccodearchive.net/info/strgrp.html

I've considered an app that would do this in the past. It would be like mint.com which automatically tracks your finances via online banking, but instead of showing you spent $100 at the supermarket, it would show that you spent $20 on beer, $50 on cash back, and $30 on food... allowing better insights into your finances & where to cut back to save money.

I've been thinking about something vaguely similar for paperwork processing. It'd be nice to pull company name from recognising the layout/logo, and an attempt at reading the date out of the page.

Anyone know any resources or an idea for direction to get started on this?

Quite tricky to get right. Parsing text from a logo is usually not so easy. So here's an alternative approach that could work: Use the first 10% of the page as the "header", no matter what it is. Store all those headers as separate image files. When you scan a new invoice/receipt, try to match it with the list of known headers. OpenCV is good for that.

Look here: http://stackoverflow.com/questions/4196453/simple-and-fast-m... http://stackoverflow.com/questions/11541154/checking-images-...

By the way, thanks for this - this has been a very useful starting point.

If you are genuinely interested in this sort of thing, I'd like to think we do a pretty good job at receipt parsing.


Disclaimer: I work for the company.

Two questions: Do you do line-item extraction?

Are you perhaps going to offer some insight about the author's attempts, and what alternative experiences you've had implementing your functionality, or what solutions/algorithm's you've used?

I think I would have more problems saving all the receipts using this workflow. Just logging them into YNAB's mobile app is great for me.

Shameless plug: this is exactly why I created http://family-fortune.ridgebit.com/. I log the transaction before the receipt is done printing, then toss the receipt.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact