Yes api and results stay mostly the same.
A small difference is that return values are always float -> e.g. a ratio of 94.664 and not rounded to 95 as it is done with FuzzyWuzzy so the results can be compared in a better way.
>RapidFuzz […] is using the string similarity calculations from FuzzyWuzzy.
>It is MIT licensed so it can be used whichever License you might want to choose for your project, while your forced to adopt the GPLv2 license when using FuzzyWuzzy
How is it using FuzzyWuzzy to be exact in order to be MIT licensed?
FuzzyWuzzy was MIT Licensed in the beginning, but then in 2011 decided to start using python-Levenshtein which is GPLv2 licensed and therefore the whole project got GPLv2 licensed.
RapidFuzz is based on this version of FuzzyWuzzy: https://github.com/rhasspy/fuzzywuzzy and implements similar algorithms in C++
This was the main reason to write it. I wrote a small script searching github for projects that use FuzzyWuzzy and then sorted them based on their license into three lists
a) GPL License b) Incompatible License c) no license.
From these three only a) is licensed correctly. From 1k repositories only about 100 used the GPL, 300 an incompatible License and 600 no license at all. So I figured there has to be some interest in having a more open licensed alternative ;)
Well it is completely reimplemented in C++ and I actually changed the algorithm in some places so it still gives the same results with a smaller time complexity. But it is still a derived work. Just not from the GPL licensed version: https://github.com/rhasspy/rapidfuzz/blob/master/LICENSE
We relicensed fuzzywuzzy mostly because I am not a lawyer and really just wanted to stop having the same argument with the same three people about whether what we were doing was legal. I actually think that we were violating the license by not including the license header on the included file, but were fine with the rest of the project being MIT, but again, not a lawyer, and the emails from the same ~3 people were becoming quite annoying.
Thats why I based it on a version before python-Levenshtein was even added to the project. The Levensthein part is just the normal levenshtein stuff that is a standard task at university I suppose, so it was definetly faster to implement this on my own instead of reading ugly C code. From my tests it is about as fast as the one used by python-Levenshtein (a little bit slower), but therefore I can read it ...
Looks awesome. I've used FuzzyWuzzy to some success in the past but its performance was abysmal, so I'm interested in giving this a try.
Any chance you could also implement a weighted-levenshtein version? (or allow passing in a weight-table for various characters combinations as like a kwarg or something?)
In the last release I added a module `rapidfuzz.levenshtein` which allows calculating a normal levenshtein distance and a weighted version where substitutions have a weight of 2 (this one is actually used by FuzzyWuzzy for ratio calculations).
But when there is interest in other weighting options I can add a version that allows to pass in custom costs for insertions, deletions, and substitutions
I once solved a performance problem by using the pure-python mysql drivers and pypy, rather than the normal c-wrapped drivers and python. The result was actually faster than a Java version of the same program using the normal JDBC drivers.
Something feels deeply wrong with a world where database performance can depend on basic string interpolation performance deep in your client-side mysql drivers... Its soul-destroying to see a naive 'safe' parameterization code coming up hottest in Java profiling results.
I am currently using FuzzyWuzzy in my project for string similarity in a large document collection. It is quite slow, so I will definitely replace FW with your lib.
Awesome work! I’m a big fan of fuzzywuzzy so I’ll be checking this out:) Currently I’m split between using fuzzy logic in my backend python code as opposed to it being built into a sql statement (via SOUNDEX() which does matching based on phonetic similarity).
I use fuzzy text searching to compare LPR license plates with those on lists for parking enforcement.
I wrote a blog entry about Vladimir Losifovich Levenshtein and practical description of string comparison here:
https://blog.easyalpr.com/2019/09/16/calculating-license-pla...
Also, here's a UI on top of fuzzy string matching if you just want to play with levenshtein distance for a few minutes: https://easyalpr.com/tools/string-compare/