Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: RapidFuzz – A fast string matching library for Python (github.com/rhasspy)
134 points by maxbachmann on March 30, 2020 | hide | past | favorite | 27 comments


Ah, I will consider this as a drop in replacement on EasyALPR.

I use fuzzy text searching to compare LPR license plates with those on lists for parking enforcement.

I wrote a blog entry about Vladimir Losifovich Levenshtein and practical description of string comparison here:

https://blog.easyalpr.com/2019/09/16/calculating-license-pla...

Also, here's a UI on top of fuzzy string matching if you just want to play with levenshtein distance for a few minutes: https://easyalpr.com/tools/string-compare/


Big fan of FuzzyWuzzy, so if it's the same API but with more performance, sounds great. The performance impact has bit me before.


Yes api and results stay mostly the same. A small difference is that return values are always float -> e.g. a ratio of 94.664 and not rounded to 95 as it is done with FuzzyWuzzy so the results can be compared in a better way.


For everyone wondering about the actual runtime difference between RapidFuzz and FuzzyWuzzy I added a couple first benchmarks based upon the benchmarks FuzzyWuzzy is using: https://github.com/rhasspy/rapidfuzz/blob/master/Benchmarks....


>RapidFuzz […] is using the string similarity calculations from FuzzyWuzzy.

>It is MIT licensed so it can be used whichever License you might want to choose for your project, while your forced to adopt the GPLv2 license when using FuzzyWuzzy

How is it using FuzzyWuzzy to be exact in order to be MIT licensed?


FuzzyWuzzy was MIT Licensed in the beginning, but then in 2011 decided to start using python-Levenshtein which is GPLv2 licensed and therefore the whole project got GPLv2 licensed. RapidFuzz is based on this version of FuzzyWuzzy: https://github.com/rhasspy/fuzzywuzzy and implements similar algorithms in C++


Good answer. Nice to see licensing issues taken seriously.


This was the main reason to write it. I wrote a small script searching github for projects that use FuzzyWuzzy and then sorted them based on their license into three lists a) GPL License b) Incompatible License c) no license. From these three only a) is licensed correctly. From 1k repositories only about 100 used the GPL, 300 an incompatible License and 600 no license at all. So I figured there has to be some interest in having a more open licensed alternative ;)


Fair enough.


Looking at the source, RapidFuzz has reimplemented FuzzyWuzzy's similarity calculations in C++:

https://github.com/rhasspy/rapidfuzz/blob/c225979db3b02caccb...

https://github.com/seatgeek/fuzzywuzzy/blob/2188520502b86375...

I think (IANAL) that that clears it up - not using the same code, but the same algorithm.


>not using the same code, but the same algorithm.

Is it really that clear cut?


Well it is completely reimplemented in C++ and I actually changed the algorithm in some places so it still gives the same results with a smaller time complexity. But it is still a derived work. Just not from the GPL licensed version: https://github.com/rhasspy/rapidfuzz/blob/master/LICENSE


Former fuzzywuzzy maintainer here.

We relicensed fuzzywuzzy mostly because I am not a lawyer and really just wanted to stop having the same argument with the same three people about whether what we were doing was legal. I actually think that we were violating the license by not including the license header on the included file, but were fine with the rest of the project being MIT, but again, not a lawyer, and the emails from the same ~3 people were becoming quite annoying.

Link here: https://github.com/seatgeek/fuzzywuzzy/issues/130

That said, if their interpretation is right, then unless you clean-roomed the reimplementation, that derivative is also GPL.

Again, not a conversation I personally care for, and I no longer work for SeatGeek so it's no longer really on me :)


Thats why I based it on a version before python-Levenshtein was even added to the project. The Levensthein part is just the normal levenshtein stuff that is a standard task at university I suppose, so it was definetly faster to implement this on my own instead of reading ugly C code. From my tests it is about as fast as the one used by python-Levenshtein (a little bit slower), but therefore I can read it ...


Looks awesome. I've used FuzzyWuzzy to some success in the past but its performance was abysmal, so I'm interested in giving this a try.

Any chance you could also implement a weighted-levenshtein version? (or allow passing in a weight-table for various characters combinations as like a kwarg or something?)

edit: Something like this, but with the speed of C++ would be amazing. https://pypi.org/project/weighted-levenshtein/


In the last release I added a module `rapidfuzz.levenshtein` which allows calculating a normal levenshtein distance and a weighted version where substitutions have a weight of 2 (this one is actually used by FuzzyWuzzy for ratio calculations). But when there is interest in other weighting options I can add a version that allows to pass in custom costs for insertions, deletions, and substitutions


I am very interested in a version that allow you to pass in a custom substitution cost! It has a great use for OCR applications!


Well then I will add it :)


`rapidfuzz.levenshtein.weighted_distance` does now support the three parameters `insert_cost`, `delete_cost` and `replace_cost`.


Yeah that would be pretty cool. I’d be interested to use that to implement phonetic distance scoring.


It reminds me every fast python library is not written in python!


For many users, the important language is the one you use to interface with the language, not the language the library was written in. :)

Not contradicting your point, btw.


Or, they're written in pure python and run under PyPy.


^ this :)

I once solved a performance problem by using the pure-python mysql drivers and pypy, rather than the normal c-wrapped drivers and python. The result was actually faster than a Java version of the same program using the normal JDBC drivers.

Something feels deeply wrong with a world where database performance can depend on basic string interpolation performance deep in your client-side mysql drivers... Its soul-destroying to see a naive 'safe' parameterization code coming up hottest in Java profiling results.


I am currently using FuzzyWuzzy in my project for string similarity in a large document collection. It is quite slow, so I will definitely replace FW with your lib.


Awesome work! I’m a big fan of fuzzywuzzy so I’ll be checking this out:) Currently I’m split between using fuzzy logic in my backend python code as opposed to it being built into a sql statement (via SOUNDEX() which does matching based on phonetic similarity).


this looks neat! trying to think of something to build with this :)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: