I myself have been working on a personal search engine for sometime, and one pro...

iudqnolq · 2024-10-10T14:45:04.000000Z

I'm super new to this so I'm probably missing something simple, but isn't a trigram index one of the canonical solutions for fuzzy search? Eg https://www.postgresql.org/docs/current/pgtrgm.html

That often involves recording original trigram position, but I think that's necessary to weigh "I like happy cats" higher than "I like happy dogs but I don't like cats" in a search for "happy cats".

warangal · 2024-10-10T15:33:13.000000Z

Yes, trigram mainly but also bigram and/or combination of both are used generally to implement fuzzy search, zoekt also uses trigram index. But such indices depend heavily on the content being indexed, for example if ever encounter a rare "trigram" during querying not indexed, they would fail to return relevant results! LSH implementations on the other hand employ a more diverse collection of stats depending upon the number of buckets and N(-gram)/window-size used, to compare better with unseen content/bytes during querying. But it is not cheap as each hash is around 30 bytes, even more than the string/text being indexed most of the time ! But its leads to fixed size hashes independent of size of content indexed and acts as an "auxiliary" index which can be queried independently of original index! Comparison of hashes can be optimized leading to a quite fast fuzzy search .