Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: uFuzzy.js – A tiny, efficient fuzzy search that doesn't suck (github.com/leeoniya)
251 points by leeoniya on Sept 30, 2022 | hide | past | favorite | 82 comments
Hello HN!

I became frustrated with the unpredictible/poor match quality and opaqueness of "relevance scores" in existing fuzzy and fulltext search libs, so I tried something different and this is the result. The main selling point is the result quality / ordering, with best-in-class memory overhead and excellent performance being bonuses. The API is pretty stable at this point, but looking for feedback before committing to 1.0.

TL;DR

The test corpus is a 4MB json file with 162k words/phrases, so give it a second for initial download. You can also drag/drop your own text/json corpus into the UI to try it against your own dataset.

Live demo/compare with a few other libs (there are many more in the codebase, in various states of completion, WIP):

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

In isolation for perf assessment:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

To increase fuzziness and get broader results, try setting intraMax=1 (core) and enable outOfOrder (userland):

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

Also play with the sortPreset selector to swap out the default Array.sort() for one in userland that prioritizes typehead-ness (the resultset remains identical).

Still TODO:

  - Example of stripping diacritics
  - Example of using non-latin charsets
  - Example of prefix-caching to improve typeahead perf even further
  - Example of poor man's document search (matching multiple object properties)
That's all, thanks!



Thank you for this!

I am also quite frustrated with the current state of full text search in the javascript world. All libs I've tried miss the most basic examples and their community seems to ignore it. Will give yours a try but it already looks much better from the comparison page.

Edit: Nope, your lib doesn't seem to handle substitution well (THE most common type of typo), so yep, we are back to square one ...


yep. the core is regexp based and there is no string distance assertion during initial pre-filtering, so this wont be a use case uFuzzy can accomodate.

the intro does caveat that it would make for a poor spellcheck :)

FlexSearch actually does pretty well and can work for you, though can get quite memory hungry depending on your tokenization settings. try other libs in my compare demo, too. there are a lot of options!


your comparisons table at the bottom is worth its weight in gold. ignore the haters, thank you OP


> ignore the haters

i'm an OSS dev; how thin do you think my skin is? :D


I wouldnt say the parent commenter is a hater.

But thats a good general sentiment yeah


I have spent many days searching for the best JavaScript implementation of full text search that handles typos (substitutions) well. Implementing a good indexing algorithm for this is not easy. In particular if you are indexing large amounts of text (documents) instead of short strings.

I settled on MiniSearch. [0] It is fast & small enough and fairly feature complete.

Afterwards I made a few contributions to improve performance and implement a better scoring algorithm. So I'm probably a bit biased now. Take my recommendation with a grain of salt.

Personally I think that OP's library does not perform searches, fuzzy or otherwise. It's much more similar to 'grep'. Try searching for "mario adventures". It won't actually find the most obvious results, because the order of the keywords in the search string must match the order of the keywords in the indexed text.

[0]: https://github.com/lucaong/minisearch

[1]: https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...


> It won't actually find the most obvious results, because the order of the keywords in the search string must match the order of the keywords in the indexed text.

i'm really confused why people dont bother actually reading anything i spent so much time distilling (both in the readme and the short instructions in this submission itself), which explicitly spell out how to adjust the necessary options precisely for what you're asking.

simply toggling outOfOrder returns perfectly good results:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...


A blunt answer to your question is: I didn't take the time to read everything.

Sorry for missing that feature. I stand by my overall point though. There is more to search than finding matches. Handling real world typing errors is one (try searching for "mario avdentures" or "mario adventutes").

Ordering results by relevance is not trivial either. And it does not appear this library intends to tackle that problem completely. Relevance scoring should take into account the term frequency and the document/field length, as well as average length.

I don't mean to discount your project though. There are many distinct use cases related to searching and filtering, and power to you for solving it in a way that works for you (and I'm sure for others as well). I just wanted to share my experience in exploring the space of full text search libraries that handle long form text and typos gracefully.


> A blunt answer to your question is: I didn't take the time to read everything.

s/everything/anything

this is by no means meant to replace fulltext search, with term omission tolerance, typos, stemming, etc.

fwiw, i wrote this to replace a frontend search strategy that was originally based on MiniSearch and [apparently] gave underwhelming results and/or performance.

since you seem to be familiar with MiniSearch, perhaps you can help improve the result ordering for this:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

this is my fundamental issue with fulltext searches. even if the result order could be made sane, how do you find the "junk cutoff" (there clearly is one). at least it does seem to place the best results on top, though ordered haphazardly, which is also true of FlexSearch.

boiling things down to a single relevance score, and making the user tweak various boosting knobs to nudge things mostly [but not perfectly] into place seems like a dogma that most of these engines suffer from.


Many search engines default to returning hits when at least 1 keyword match. Not returning any results just because not every single keyword could be matched leads to a poor user experience. However, good relevance scoring is key here. Hits that match most keywords should be somewhere at the top. The reasoning is that a user will try to find the result they are looking for in the top hits.

In the case of MiniSearch you can change this default and configure it to only return hits if all keywords match [0].

Searching for "super ma" seems like an autocomplete query, for which most users have subtly different expectations than regular search. MiniSearch has a separate method for that, essentially baking in different default settings. [1]

[0] https://lucaong.github.io/minisearch/classes/_minisearch_.mi...

[1] https://lucaong.github.io/minisearch/classes/_minisearch_.mi...

Edit:

> s/everything/anything

No need for the sneer.


ok, thanks for the advice. i'll see if the MiniSearch settings can be improved for this demo without drastically affecting the quality of the "search" case.


Interesting that uFuzzy depends on regexps. Python's regex library can handle fuzzy matching (up to a given number of insert/delete/substitutions) - if JavaScript had that, it would be a relatively simple change.


i'm curious how a regexp can handle arbitrary substitutions. afaik you need to be explicit about what alterations can occur for each character, but if that list is arbitray, then the regexp will simply match everything.

e.g.

/^cat$/ will match cat

/^ca[tr]$/ will match cat and car

/^ca\w$/ will match any 3 letter word with 'ca' prefix. (any substitution for third letter)

/^\w\w\w$/ will allow alterations for any letter, making it useless.

you can make a more complex regex that allows a single substitution in any char position

/^(?:ca\w|c\wt|\wat)$/

but this gets out of hand very quickly, especially with possible insertions, deletions, transpositions, etc.


Sure, but then that's pretty much the core thing to solve on the fuzzy search problem.

Anyway, not being a hater (as that other commenter suggested), if you could add support for this (even if it's limited, one or two typos at most) you will blow all other libs out of the way since what you have now is already quite good :D


> even if it's limited, one or two typos at most

This would help a lot I think.

When I'm on mobile, the most common error is "keyboard offset error", where I hit a "key" next to the one I intended. So it's not completely arbitrary, and it only happens once or twice in 99% of the cases.

On a physical keyboard this also happens, but the more likely error is synchronization error, where I hit a left-hand key before a right-hand key or vice versa, the classic teh vs the. Again usually only a single such error per word and not arbitrary.

Finally there's also the common case of simply missing a letter. Again, limited and not arbitrary.

So at least for my sake, anything that can handle the above errors would go a long way.


maybe, something limited in scope like this could work...would be interesting to explore.


It might be a rare opinion in JS ecosystem, but I like the scope of this project. It solves the crux of the problem; I can compose the rest (like testing other strings alongside the input to account for typos)


you're not gonna like the perf of doing this for all variations. it will be far faster to put together a regex with all variations and do a single pass.

of course if you have specific known/common mistakes, it could be useful to only consider those. for example, spelling errors are less common at the start of words. and keyboard letter proximity limits which mistakes are likely.


> for example, spelling errors are less common at the start of words

At least on mobile, I find I make just as many first letter mistakes by hitting the wrong "key" on the on-screen keyboard, as I do in any other position of the word. Very annoyingly the predictive text engine assumes like you mention and it takes a lot for it to consider the first letter being wrong.


https://pypi.org/project/regex/ search for 'fuzzy'

The regex library builds the NFA, so it probably bakes the fuzzy stuff into the NFA itself rather than changing the pattern.

Another option is something like word2vec where you cluster words of similar meaning together, as a bonus this usually handles typos as well. Not really in scope for your library, but I find it cool!


this has now been addressed :)

https://news.ycombinator.com/item?id=33053180


From fuzzy search I expected that entering "super meet boy" or "super maet boy" will return "Super Meat Boy" but unfortunately currently it doesn't work this way and it's quite disappointing.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...


this has now been addressed by 3 additional options that handle all variations of single substitution, single transposition or single deletion. (they're not exposed in the demo form controls yet, but can be enabled via url params):

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

commit:

https://github.com/leeoniya/uFuzzy/commit/63dc67b8bdb7577f85...


this is answered in other parts of the thread here [1], but there might be some hope for it still!

https://github.com/leeoniya/uFuzzy/issues/2

however, it's probably infeasible to accomodate more than single-char-per-term substitution tolerance. thankfully, both your examples have 1-char substitutions :)

[1] https://news.ycombinator.com/item?id=33042665


It is explained, but than why is it fuzzy? Isn't the fuzziness part exactly what's missing?


I think you have a different idea of what fuzzy means than the creator. I use a fuzzy search in my config that wouldn’t match the parent example. As soon as you include a word that isn’t in the match, the match is removed. I am more concerned that the fuzziness would prioritize a match of “super meet boy and more” over “super and the gang explore meeting friends o boy” because it’s closer to the search string (by what I would define as fuzzy).


Fuzzy searching generally doesn't mean exactly the same thing as typo correcting or other full text search features.

It's more like you can search using word parts. Similar to how your IDE's search work when jumping to files. E.g. you can type 'smb' to find 'Super Meat Boy'. Or type something like 'sup mea acc' to find 'Super Meat Boy Accessed Content'

So it requires you to know exactly what you're looking for, but you can find it quickly without having to type a lot.

Generally for full text search like you're describing you need to do that on the server side. It would be too heavy to have something full featured like that on the client side.


there are many ways to be fuzzy, so it's quite subjective.

uFuzzy can be made tolerant to extra insertions in the matches between/around the specified needle chars, and can also handle out of order terms. those together cover a surprising amount of common cases.

but it's not fuzzy in unlimited ways, such as letter omissions in the match (a superset of substitutions) like a spellcheck or levenshtein distance would be...but extreme tolerance often produces garbage results, too.

actually, might be able to handle single-char-per-term omissions as well:

https://github.com/leeoniya/uFuzzy/issues/2#issuecomment-126...


Nice. Here’s the fuzzymatcher I wrote years ago. My main implementation was C++ but there’s a JS version and web demo.

https://www.forrestthewoods.com/blog/reverse_engineering_sub...


heh, think i have yours in my comparison demos (on phone currently, cannot verify)

you can make uFuzzy behave similarly by setting intraMax to Infinity (just remove 0 from the field). but the results are usually too fuzzy in this config, though it depends on the corpus and application (auto-complete vs search)


Skimming I don’t think you have my code. Also on phone so may have missed it.

Amusingly you do have Hearthstone cards and UE4 filenames whose data definitely comes from my post.

My code has been used in more than a few fuzzy matchers. Anytime I see one posted I always check to see if it’s related or not. =D

Mine was definitely tuned for filenames and matching CamelCaseWords.


yeah, i took the corpus from fuzzysort and extended it some :)

> Mine was definitely tuned for filenames and matching CamelCaseWords.

mhm, mine considers case-change and alpha-num boundaries same as whitespace and punct boundaries (boostable), so will also float those results up when appropriate.


Ha. Yeah those datasets are definitely used by a lot of fuzzy libs. I generated the original and I’m like a proud great grandpa at this point. :)

I am annoyed Blizzard never implemented fuzzy matching for their cards. At least not the last time I played Hearthstone…


It doesn't seem to handle missing spaces

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...


set intraMax to something big, or Infinity (delete the 0). this will allow arbitrary amount of junk between each char


I see some results when intraMax is set to Inf but it doesn't seem to pick up "Llanowar Elves" despite the other libs hitting it first or second. If I narrow the list to only mtg_16000, I get 0 results.


ah yes you should expand the allowable arbitray chars within terms to include a space

intraChars: [a-z\d ]

(probably not ideal for perf)

you can leave intraMax at 1 in this case

you dont need either of these if you have camelcase LlanowarElves but will for lowercase elves: Llanowarelves


Nice. I didn't know it but I was about to be looking for something like this.

How difficult - or not - would it be to use it with https://bootstrap-table.com?


not familiar with it, but depends on what you need.

filtering one one column? easy.

filtering multiple columns via AND? also easy.

filtering multiple columns plus highlighting matched parts in each? will take more work, but shouldnt be daunting.


Ok. Thx. I'll give your's a go. If I stumble I'll open an issue.

Truth be told I'm working on a side project POC / MVP, found bootstrap table, and just dove in. I know 3 to 6 hrs more than you :)


Looks interesting - I would like to try this library out to compare it with Fuse that I wrote about here https://www.lloydatkinson.net/posts/2022/writing-a-fuzzy-sea...

Also as this is a new library I highly recommend either fully dropping commonjs support or at least creating a ES module version and showing how to use that instead.


to compare it with Fuse, click the provided compare link in this post :)

> at least creating a ES module version

that exists


Document it then!


The example ”spac ca” doesn’t take advantage of fuzziness (no typos/swaps), a simple inverted index of term prefixes/n-grams would produce a good enough search there.


can you clarify what you mean by "doesn’t take advantage of fuzziness"?

what results are you expecting but not getting back? or what results are you getting back that you are not expecting?


I tend to think of fuzzy search as having specific leniency: for example, ”xemaple” would find ”example” and ”exmaple”


in reality what happens with this level of tolerance is you get back "example" followed by other nonsense results, which is exactly what uFuzzy tries to avoid. there are plenty of existing fuzzy searches that behave too loosely.

xemaple has two transposition errors, one in the critical (and unusual) prefix position. if we allowed this level of fuzz in "spac ca" it should presumably match "psac ca", "spca ac", "sapc ac". it seems like a good idea at first, but with a big enough corpus you start to realize that these mangled variations actually appear as substrings in very unrealated matches, (they're all two swap errors away)

i dont think you can have what you're asking for without destroying match quality and making things a lot slower in the process - the very qualities that distinguish uFuzzy from what already exists.

i did add support for single transpositions, substitutions, and deletions (in non prefix or suffix positions) yesterday. so your version with one swap ("exmaple") should work.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...


Bzzzz, unrelated comment:

Some weeks ago (86 days ago) you commented on a post with this: <<

”Dammit, Jack, we got another one.” ”Another awakening?”

”Yeah.”

”Just do what we always do, make their argument look nihilist or absurdist. Remember to avoid mentioning capitalism and neoconservatism!”

>>

and then said that you plan on becoming a professional author. I remembered of your existence and it crossed my mind that I'm curious about how that is going for you.


That was a joke. I have become a professional joke author.


Also possibly of interest in this regard: Pagefind [0].

[0]: https://github.com/cloudcannon/pagefind


I'm impressed. 3.54KB minified, great performance, good results.


> best-in-class memory overhead

Was using indexeddb considered? I've yet to see an easy to use library that allows you to store simple but large json in indexeddb, and then query against that. Useful for something as simple as an emoji picker which needs to store keywords or aliases.


not sure this would be less overhead than having all strings in memory. you gotta get them into memory anyhow, right?

i mean, you can store it in localstorage or wherever to save net transfer for repeated use, but i dont think it would be faster to use indexdb directly at runtime.


I think indexeddb is somewhat orthogonal to this library.

The memory efficiency you might get would be that you don't need to hold the whole dataset in memory while running the filter step though at the moment it looks like it assumes you're working with an array in memory (https://github.com/leeoniya/uFuzzy/blob/main/src/uFuzzy.js#L...). That said I suspect there distance between this and something that could search against a stream of data is pretty short.


im happy with fuse js - in my case i need to match against messy data sets (typos, underscores for spaces, etc) and it works quite well; in this dataset it's hard to compare how ufuzzy would fare. what algorithm does it implement?


if you need solid typo tolerance uFuzzy won't work well since it's based on regexps.


"doesn't suck" - why is this in the title? What does one hope to achieve by this phrase?


This reminds me of my dream of a decent desktop search product


Nice! Where I work we need to look at how we want to implement search and I’ve been putting it off. Will put this in our list to evaluate!


Out of curiosity, how does this compare with fuse.js?



Christ that demo is laggy. Can you move the search to a web worker so the UI doesn't lag whilst typing?


sorry that Fuse is slow, but that's the point of the demo, isn't it?


If it's render blocking, that's the whole point right? A lot of people won't use these libraries on a separate thread.


its hard to compare directly because by default they return different results; so any comparison should control for that. that said the initial numbers for ufuzzy are tiny, so im quite curious


it's very hard to compare. definitely eval each lib's results before accepting any bench numbers as valid!

and also please submit any improvements to the codebase to get the results better/closer. i'm only an expert in one lib (mine).


Amazing job on this!

I see that your heap is incredibly small, how do you keep such a small working set of memory while keeping the library so fast?


ask the V8/Spidermonkey/JavaScriptCore devs how they have such fast regexp interfaces :)

other than that, nothing special. the stats-gathering pass works with large columnar arrays rather than thousands of objects. and it only does this when the pre-filtered resultset is small enough (default threshold is 1000). there's no string distance checks done, so there's no need to heap-allocate thousands of M*N matricies.


have you considered any search methods that are too big for memory or at least want to avoid making the user wait to download, but doesn't require a dynamic server backend?


no, and definitely outside the scope of a micro-lib!


JSON search should be a web standard not reliant on JavaScript. Why not just have a way to include a link element that can be used as part of the search.


you could probably say the same about declarative ui frameworks, but given how many opinions there are for what the "right" way is in both cases, i doubt this will ever happen.


From the README: > This is my fuzzy . There are many like it, but this one is mine.

Another new better package in the JavaScript world just for the sake of pride.


congratulations on reading nothing else. best time saving technique!

in case you're still confused:

https://web.archive.org/web/20070310183121/http://www.lejeun...


I didnt mean to be disrespectful. I just suffered so much the pain of what I highlighted of the introduction that I reacted too fast. I apologize to the author of the package.

PS: leeoniya, your link makes no sense in this context.


no worries.

my link explains where that phrase comes from. it's a play on words that has nothing to do with pride or why i made "another one" (which is explained in great detail in many other places).

i made uFuzzy because i couldnt get the other 20 existing libraries to return only the results that i expected.


How does it compare to levenstein?


levenstein by itself isnt really anything so this question is unanswerable. string distance is only a part of a search algo. generally i expect that it will be significantly slower, but more tolerant to typos. the rest heavily depends on many other implementation details.


nice!

for a language that evolved to manipulate text documents it is odd that it has no features of this kind. StartsWith endsWith and indexOf seems an amazingly unsophisticated set of tools.

autocomplete ui is also terrible compared to phones?

why?


>a language that evolved to manipulate text documents

Tell me you know nothing about JavaScript's history, without telling me you know nothing about JavaScript's history.


Strange bug in my query. I was trying to say js, for a good while, did little more than manipulate html. After some uhhh inbreeding(?) it evolved to do wild crazy things like serverside stuff (read: insane) where it puts the entire document together. (hallelujah!) But before that it was just text?

We got arrow functions use strict template strings and even asm.js, countless truly insane things were added to the eco system like unicode symbols and css animations (madness!) all of a quality as~if a late Sunday night project - per www tradition.

Im simply suggesting the next weird thing should be fuzzy text matching. Even if the implementation is complete garbage, like a permanently frozen turd, say, string compare returning a value between 0 and 1 I could see myself use it often enough. If needed one can always wrap some enormous enterprise natural language processing api to keylog the user and beam their delicious data back to the mothership for advertisement and creditscores etc

I know we cant have nice things but I can dream and it doesn't have to be nice?


when Larry Ellison decided to generously author a free script version of Java...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: