
Writing a fuzzy receipt parser in Python - andygrunwald
http://tech.trivago.com/2015/10/06/python_receipt_parser/
======
bariumbitmap
It's a shame that receipts don't have machine readable output.

QR codes can hold a little over 1,200 characters, which should be more than
enough for most receipts.

[https://en.wikipedia.org/wiki/QR_code#Storage](https://en.wikipedia.org/wiki/QR_code#Storage)

Edit: related link: [https://www.quora.com/Can-and-how-cash-register-receipts-
be-...](https://www.quora.com/Can-and-how-cash-register-receipts-be-
transformed-to-a-QR-code-and-scanned-with-a-smartphone-app-so-it-can-become-
digital?share=1)

~~~
tonyedgecombe
I'd be happy if they made the store name, date and total more prominent, it's
amazing how often that is hard to find despite it being the most important
information to the customer.

~~~
omn1
Exactly. Even the term for the total changes with every receipt. I disovered
all the possible terms you can imagine, like "sum", "total amount",
"overall",... (in German obviously)

------
laito
Hey, this is pretty cool. I actually tried something similar. (Keeping a list
of shop names and matching it with tesseract's results) I was trying hough
transform for slight image rotations. I wasn't aware of imagemagick's
textcleaner script. That could have save me a lot of trouble :) I got
roadblocked by the problem of having various kinds of receipts with absolutely
no layout in common. I figured it would need a lot of training for the system
to have a decent accuracy and left it for another day.

~~~
omn1
Yeah, relying on the layout never worked for me as well. For instance even the
supermarket address is not standardized. Some shops use:

> Market name > Examplestreet 12 > 19393 Examplecity

while others use:

> Market name > 19393 Examplecity > Examplestreet 12

We're not even talking about the invoice/receipt layout. It's different all
the time.

~~~
mseri
Cool. We did a similar thing at an hack, integrated with Dropbox and an
automatic monthly receipt generation. I think most if not all the code should
still be in pieces on our github accounts

------
omn1
Hey, author here. I am happy for all questions or every kind of feedback.

~~~
whosbein
This is really cool, nice work! I tinkered with the same thing a while ago.
Sadly, that project has been neglected. One thing I was most proud of was
developing a way to automatically rotate the receipt image to the most ideal
position to better help the OCR. I would notice decent improvements even with
a few degrees difference.

Question: have you thought about any better methods for getting the receipts
scanned? A big hurdle for me was the time it took to scan everything. I know
there are scanners designed for such a thing, and you seem to have used one,
but were you satisfied with it?

And thanks for the get_close_matches() hint! I'll try and remember that if I
ever start hacking again on this.

~~~
omn1
Yeah scanning the receipts takes the longest time in the whole process. So you
should make sure that you have a fast and simple scanner. I can recommend to
buy a simple photo scanner like I did. It's just a couple bucks and it will
save you so much time. And the quality was near perfect. Even very old
receipts could be parsed after scanning with it.

When it comes to rotation, you can try
[http://www.fmwconcepts.com/imagemagick/textcleaner/](http://www.fmwconcepts.com/imagemagick/textcleaner/).
It will rotate your scan if it's not perfectly aligned.

~~~
whosbein
Ha, cool, didn't know about textcleaner either. Always fun when you write some
huge routine to do something that you later find out is a simple command
elsewhere. Thanks!

~~~
zo1
Just a quick FYI. I looked at "Fred's Scripts" a while back for a commercial
project I was working on. Have a look at the license, if you ever plan on any
commercial work using those scripts.

------
pbnjay
For the next step, and easier name matching... why not export a CSV of your
online banking and use names and totals to match? Or are these cash receipts?

~~~
omn1
Good idea! Most of the time I pay in cash when I'm at a supermarket or at a
gas station. So I don't really have the data in online banking. It's kind of
common in Germany.

------
joshribakoff
I've considered an app that would do this in the past. It would be like
mint.com which automatically tracks your finances via online banking, but
instead of showing you spent $100 at the supermarket, it would show that you
spent $20 on beer, $50 on cash back, and $30 on food... allowing better
insights into your finances & where to cut back to save money.

------
misnome
I've been thinking about something vaguely similar for paperwork processing.
It'd be nice to pull company name from recognising the layout/logo, and an
attempt at reading the date out of the page.

Anyone know any resources or an idea for direction to get started on this?

~~~
omn1
Quite tricky to get right. Parsing text from a logo is usually not so easy. So
here's an alternative approach that could work: Use the first 10% of the page
as the "header", no matter what it is. Store all those headers as separate
image files. When you scan a new invoice/receipt, try to match it with the
list of known headers. OpenCV is good for that.

Look here: [http://stackoverflow.com/questions/4196453/simple-and-
fast-m...](http://stackoverflow.com/questions/4196453/simple-and-fast-method-
to-compare-images-for-similarity)
[http://stackoverflow.com/questions/11541154/checking-
images-...](http://stackoverflow.com/questions/11541154/checking-images-for-
similarity-with-opencv)

~~~
misnome
By the way, thanks for this - this has been a very useful starting point.

------
t_g
If you are genuinely interested in this sort of thing, I'd like to think we do
a pretty good job at receipt parsing.

[http://www.neat.com/](http://www.neat.com/)

Disclaimer: I work for the company.

~~~
zo1
Two questions: Do you do line-item extraction?

Are you perhaps going to offer some insight about the author's attempts, and
what alternative experiences you've had implementing your functionality, or
what solutions/algorithm's you've used?

------
comrh
I think I would have more problems saving all the receipts using this
workflow. Just logging them into YNAB's mobile app is great for me.

~~~
IgorPartola
Shameless plug: this is exactly why I created [http://family-
fortune.ridgebit.com/](http://family-fortune.ridgebit.com/). I log the
transaction before the receipt is done printing, then toss the receipt.

