Show HN: RapidFuzz – A fast string matching library for Python

bredren · on March 30, 2020

Ah, I will consider this as a drop in replacement on EasyALPR.

I use fuzzy text searching to compare LPR license plates with those on lists for parking enforcement.

I wrote a blog entry about Vladimir Losifovich Levenshtein and practical description of string comparison here:

https://blog.easyalpr.com/2019/09/16/calculating-license-pla...

Also, here's a UI on top of fuzzy string matching if you just want to play with levenshtein distance for a few minutes: https://easyalpr.com/tools/string-compare/

meowface · on March 30, 2020

Big fan of FuzzyWuzzy, so if it's the same API but with more performance, sounds great. The performance impact has bit me before.

maxbachmann · on March 30, 2020

Yes api and results stay mostly the same. A small difference is that return values are always float -> e.g. a ratio of 94.664 and not rounded to 95 as it is done with FuzzyWuzzy so the results can be compared in a better way.

maxbachmann · on March 30, 2020

For everyone wondering about the actual runtime difference between RapidFuzz and FuzzyWuzzy I added a couple first benchmarks based upon the benchmarks FuzzyWuzzy is using: https://github.com/rhasspy/rapidfuzz/blob/master/Benchmarks....

Hamuko · on March 30, 2020

>RapidFuzz […] is using the string similarity calculations from FuzzyWuzzy.

>It is MIT licensed so it can be used whichever License you might want to choose for your project, while your forced to adopt the GPLv2 license when using FuzzyWuzzy

How is it using FuzzyWuzzy to be exact in order to be MIT licensed?

maxbachmann · on March 30, 2020

FuzzyWuzzy was MIT Licensed in the beginning, but then in 2011 decided to start using python-Levenshtein which is GPLv2 licensed and therefore the whole project got GPLv2 licensed. RapidFuzz is based on this version of FuzzyWuzzy: https://github.com/rhasspy/fuzzywuzzy and implements similar algorithms in C++

edraferi · on March 30, 2020

Good answer. Nice to see licensing issues taken seriously.

maxbachmann · on March 30, 2020

This was the main reason to write it. I wrote a small script searching github for projects that use FuzzyWuzzy and then sorted them based on their license into three lists a) GPL License b) Incompatible License c) no license. From these three only a) is licensed correctly. From 1k repositories only about 100 used the GPL, 300 an incompatible License and 600 no license at all. So I figured there has to be some interest in having a more open licensed alternative ;)

Hamuko · on March 30, 2020

Fair enough.

akx · on March 30, 2020

Looking at the source, RapidFuzz has reimplemented FuzzyWuzzy's similarity calculations in C++:

https://github.com/rhasspy/rapidfuzz/blob/c225979db3b02caccb...

https://github.com/seatgeek/fuzzywuzzy/blob/2188520502b86375...

I think (IANAL) that that clears it up - not using the same code, but the same algorithm.

Hamuko · on March 30, 2020

>not using the same code, but the same algorithm.

Is it really that clear cut?

maxbachmann · on March 30, 2020

Well it is completely reimplemented in C++ and I actually changed the algorithm in some places so it still gives the same results with a smaller time complexity. But it is still a derived work. Just not from the GPL licensed version: https://github.com/rhasspy/rapidfuzz/blob/master/LICENSE

josegonzalez · on March 30, 2020

Former fuzzywuzzy maintainer here.

We relicensed fuzzywuzzy mostly because I am not a lawyer and really just wanted to stop having the same argument with the same three people about whether what we were doing was legal. I actually think that we were violating the license by not including the license header on the included file, but were fine with the rest of the project being MIT, but again, not a lawyer, and the emails from the same ~3 people were becoming quite annoying.

Link here: https://github.com/seatgeek/fuzzywuzzy/issues/130

That said, if their interpretation is right, then unless you clean-roomed the reimplementation, that derivative is also GPL.

Again, not a conversation I personally care for, and I no longer work for SeatGeek so it's no longer really on me :)

maxbachmann · on March 30, 2020

Thats why I based it on a version before python-Levenshtein was even added to the project. The Levensthein part is just the normal levenshtein stuff that is a standard task at university I suppose, so it was definetly faster to implement this on my own instead of reading ugly C code. From my tests it is about as fast as the one used by python-Levenshtein (a little bit slower), but therefore I can read it ...

germanjoey · on March 30, 2020

Looks awesome. I've used FuzzyWuzzy to some success in the past but its performance was abysmal, so I'm interested in giving this a try.

Any chance you could also implement a weighted-levenshtein version? (or allow passing in a weight-table for various characters combinations as like a kwarg or something?)

edit: Something like this, but with the speed of C++ would be amazing. https://pypi.org/project/weighted-levenshtein/

maxbachmann · on March 30, 2020

In the last release I added a module `rapidfuzz.levenshtein` which allows calculating a normal levenshtein distance and a weighted version where substitutions have a weight of 2 (this one is actually used by FuzzyWuzzy for ratio calculations). But when there is interest in other weighting options I can add a version that allows to pass in custom costs for insertions, deletions, and substitutions

germanjoey · on March 30, 2020

I am very interested in a version that allow you to pass in a custom substitution cost! It has a great use for OCR applications!

maxbachmann · on March 30, 2020

Well then I will add it :)

maxbachmann · on March 31, 2020

`rapidfuzz.levenshtein.weighted_distance` does now support the three parameters `insert_cost`, `delete_cost` and `replace_cost`.

edraferi · on March 30, 2020

Yeah that would be pretty cool. I’d be interested to use that to implement phonetic distance scoring.

h8hawk · on March 30, 2020

It reminds me every fast python library is not written in python!

AndrewOMartin · on March 30, 2020

For many users, the important language is the one you use to interface with the language, not the language the library was written in. :)

Not contradicting your point, btw.

j88439h84 · on March 30, 2020

Or, they're written in pure python and run under PyPy.

willvarfar · on March 30, 2020

^ this :)

I once solved a performance problem by using the pure-python mysql drivers and pypy, rather than the normal c-wrapped drivers and python. The result was actually faster than a Java version of the same program using the normal JDBC drivers.

Something feels deeply wrong with a world where database performance can depend on basic string interpolation performance deep in your client-side mysql drivers... Its soul-destroying to see a naive 'safe' parameterization code coming up hottest in Java profiling results.

gillesjacobs · on March 30, 2020

I am currently using FuzzyWuzzy in my project for string similarity in a large document collection. It is quite slow, so I will definitely replace FW with your lib.

mslip · on March 30, 2020

Awesome work! I’m a big fan of fuzzywuzzy so I’ll be checking this out:) Currently I’m split between using fuzzy logic in my backend python code as opposed to it being built into a sql statement (via SOUNDEX() which does matching based on phonetic similarity).

holler · on March 30, 2020

this looks neat! trying to think of something to build with this :)