
The math and algorithms behind a search library - kbr
https://blog.kabir.ml/posts/inside-wade
======
JamesMcMinn
This is a an interesting way of doing text search over a collection of
documents, that I suppose makes sense if you only have a small number of
documents to search.

Generally, an inverted index [1] is used rather than a trie, and collection-
based term weights are calculated using something like IDF [2]. You could then
use something like cosine to compute length normalised rankings for documents.

There are of course many different approaches to text search, but this is the
first time I've personally come across this one.

If anyone is interested in an in depth look at how text search works, an
Introduction to Information Retrieval [3] is available for free and is a
fantastic book.

[1]
[https://en.wikipedia.org/wiki/Inverted_index](https://en.wikipedia.org/wiki/Inverted_index)

[2]
[https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

[3] [https://nlp.stanford.edu/IR-book/](https://nlp.stanford.edu/IR-book/)

~~~
kbr
This is how Wade was originally going to be implemented. I actually had a
local version of it using an inverted index and a deterministic acyclic finite
state automaton [1]. It was using TF-IDF as well for the rankings.

I began to experiment with a couple of my own ideas and used a trie rather
than a DAFSA because it could be stored and loaded rather than generated at
runtime. I also created and used the various linear methods described in the
post to rank the documents.

[1]
[https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...](https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton)

~~~
mooneater
Consider BM25 as a newer substitute for tf-idf
[https://en.wikipedia.org/wiki/Okapi_BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

------
amelius
Finally a post on search algorithms on HN again!

Pardon my surprise, but I think it's kind of strange that we see so little on
HN of this topic, which is one of the pillars of Computer Science, and also at
the core of one of the most successful IT companies.

~~~
leetcrew
I think we see surprisingly little of anything search related. In terms of
full-scale solutions, there is basically just Elasticsearch and solr. When you
look at lower level libraries available, there is very little outside the
Lucene world.

~~~
sorokod
I believe that the search part of both ES and Solr is Apache Lucene

~~~
amelius
Yes, it's good software, but pretty much a monoculture.

~~~
erpellan
Yes, and?

------
wyc
If you're a working programmer looking for a bit more depth on text searching,
check out Taming Text by Ingersoll. It was written by some real pros, and it's
very practical. It goes over everything from the first naive approaches to
what drives many engines like Lucene today.

[https://www.amazon.com/Taming-Text-Find-Organize-
Manipulate/...](https://www.amazon.com/Taming-Text-Find-Organize-
Manipulate/dp/193398838X)

------
stochastic_monk
I'm a little surprised to not be seeing some classical search algorithms.

For online pattern searching, I'd want Boyer-Moore with the the Galil rule for
guaranteed linear search. [Due to an unfortunate accident in C++ history,
Boost and C++17 skip the Galil rule, leading to O(n+m) runtime vs. O(nm).]

For offline, Burrows-Wheeler has performance linear with query string and
_independent_ of the length of the searched document.

~~~
kbr
This isn't an exact pattern searching library. Wade originally used Boyer-
Moore, but has since switched to a trie [1] structure to search for _terms_
rather than the whole query.

This allows for result documents that can contain some of the terms, all of
the terms, or a prefix of a term. It's also pretty fast as it doesn't need to
run a linear search through the documents. I'm not sure what you mean by
"classical" search algorithms, but a trie based search is a pretty standard
way of doing it.

[1] [https://en.wikipedia.org/wiki/Trie](https://en.wikipedia.org/wiki/Trie)

~~~
stochastic_monk
I understand, thank you. (Though a lot of inexact pattern matching is done
with using exact pattern matching of substrings with the pidgeonhole
principle.)

Tries led to suffix trees led to suffix arrays led to Burrows-Wheeler
transform led to FM index, which is probably the best way to do an offline
search.

If you haven't looked at BWT indexing, it's worth at least checking out. It
preserves the advantages you speak of with linear memory cost with the search
space. It's a lot more hairy to implement, unfortunately.

~~~
stochastic_monk
A nice summary can be found here:

[http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_a...](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf).

------
bootcat
Certainly its worth taking a look at another javascript library which works
for fuzzy search called, fuse, [http://fusejs.io/](http://fusejs.io/).

------
jihadjihad
This is neat, and reminds me a bit of the tricks used by RediSearch. I came
across it on HN recently and have been geeking out on it since!
[http://redisearch.io](http://redisearch.io)

------
johnhenry
Lunr, named to correspond to Apache Solr, is also worth a look.
[https://lunrjs.com/](https://lunrjs.com/)

------
hnrc
[https://news.ycombinator.com/item?id=14805395](https://news.ycombinator.com/item?id=14805395)

~~~
kbr
During the time of that post, this article was about a simpler example and
only explained the trie structure. Wade didn't have the ranking system either.
It has since been rewritten to include the methods used to rank the relevance
of a term to the document, and to see how they are applied in a more complex
example.

