
Show HN: RapidFuzz – A fast string matching library for Python - maxbachmann
https://github.com/rhasspy/rapidfuzz
======
bredren
Ah, I will consider this as a drop in replacement on EasyALPR.

I use fuzzy text searching to compare LPR license plates with those on lists
for parking enforcement.

I wrote a blog entry about Vladimir Losifovich Levenshtein and practical
description of string comparison here:

[https://blog.easyalpr.com/2019/09/16/calculating-license-
pla...](https://blog.easyalpr.com/2019/09/16/calculating-license-plate-
similarity-using-levenshtein-distance/)

Also, here's a UI on top of fuzzy string matching if you just want to play
with levenshtein distance for a few minutes:
[https://easyalpr.com/tools/string-
compare/](https://easyalpr.com/tools/string-compare/)

------
meowface
Big fan of FuzzyWuzzy, so if it's the same API but with more performance,
sounds great. The performance impact has bit me before.

~~~
maxbachmann
Yes api and results stay mostly the same. A small difference is that return
values are always float -> e.g. a ratio of 94.664 and not rounded to 95 as it
is done with FuzzyWuzzy so the results can be compared in a better way.

------
Hamuko
> _RapidFuzz […] is using the string similarity calculations from FuzzyWuzzy._

> _It is MIT licensed so it can be used whichever License you might want to
> choose for your project, while your forced to adopt the GPLv2 license when
> using FuzzyWuzzy_

How is it using FuzzyWuzzy to be exact in order to be MIT licensed?

~~~
maxbachmann
FuzzyWuzzy was MIT Licensed in the beginning, but then in 2011 decided to
start using python-Levenshtein which is GPLv2 licensed and therefore the whole
project got GPLv2 licensed. RapidFuzz is based on this version of FuzzyWuzzy:
[https://github.com/rhasspy/fuzzywuzzy](https://github.com/rhasspy/fuzzywuzzy)
and implements similar algorithms in C++

~~~
edraferi
Good answer. Nice to see licensing issues taken seriously.

~~~
maxbachmann
This was the main reason to write it. I wrote a small script searching github
for projects that use FuzzyWuzzy and then sorted them based on their license
into three lists a) GPL License b) Incompatible License c) no license. From
these three only a) is licensed correctly. From 1k repositories only about 100
used the GPL, 300 an incompatible License and 600 no license at all. So I
figured there has to be some interest in having a more open licensed
alternative ;)

------
germanjoey
Looks awesome. I've used FuzzyWuzzy to some success in the past but its
performance was abysmal, so I'm interested in giving this a try.

Any chance you could also implement a weighted-levenshtein version? (or allow
passing in a weight-table for various characters combinations as like a kwarg
or something?)

edit: Something like this, but with the speed of C++ would be amazing.
[https://pypi.org/project/weighted-
levenshtein/](https://pypi.org/project/weighted-levenshtein/)

~~~
maxbachmann
In the last release I added a module `rapidfuzz.levenshtein` which allows
calculating a normal levenshtein distance and a weighted version where
substitutions have a weight of 2 (this one is actually used by FuzzyWuzzy for
ratio calculations). But when there is interest in other weighting options I
can add a version that allows to pass in custom costs for insertions,
deletions, and substitutions

~~~
germanjoey
I am very interested in a version that allow you to pass in a custom
substitution cost! It has a great use for OCR applications!

~~~
maxbachmann
Well then I will add it :)

------
h8hawk
It reminds me every fast python library is not written in python!

~~~
j88439h84
Or, they're written in pure python and run under PyPy.

~~~
willvarfar
^ this :)

I once solved a performance problem by using the pure-python mysql drivers and
pypy, rather than the normal c-wrapped drivers and python. The result was
actually faster than a Java version of the same program using the normal JDBC
drivers.

Something feels deeply wrong with a world where database performance can
depend on basic string interpolation performance deep in your client-side
mysql drivers... Its soul-destroying to see a naive 'safe' parameterization
code coming up hottest in Java profiling results.

------
gillesjacobs
I am currently using FuzzyWuzzy in my project for string similarity in a large
document collection. It is quite slow, so I will definitely replace FW with
your lib.

------
maxbachmann
For everyone wondering about the actual runtime difference between RapidFuzz
and FuzzyWuzzy I added a couple first benchmarks based upon the benchmarks
FuzzyWuzzy is using:
[https://github.com/rhasspy/rapidfuzz/blob/master/Benchmarks....](https://github.com/rhasspy/rapidfuzz/blob/master/Benchmarks.md)

------
mslip
Awesome work! I’m a big fan of fuzzywuzzy so I’ll be checking this out:)
Currently I’m split between using fuzzy logic in my backend python code as
opposed to it being built into a sql statement (via SOUNDEX() which does
matching based on phonetic similarity).

------
holler
this looks neat! trying to think of something to build with this :)

