Hacker News new | past | comments | ask | show | jobs | submit login
The math and algorithms behind a search library (kabir.ml)
202 points by kbr on Nov 11, 2017 | hide | past | favorite | 24 comments

This is a an interesting way of doing text search over a collection of documents, that I suppose makes sense if you only have a small number of documents to search.

Generally, an inverted index [1] is used rather than a trie, and collection-based term weights are calculated using something like IDF [2]. You could then use something like cosine to compute length normalised rankings for documents.

There are of course many different approaches to text search, but this is the first time I've personally come across this one.

If anyone is interested in an in depth look at how text search works, an Introduction to Information Retrieval [3] is available for free and is a fantastic book.

[1] https://en.wikipedia.org/wiki/Inverted_index

[2] https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[3] https://nlp.stanford.edu/IR-book/

This is how Wade was originally going to be implemented. I actually had a local version of it using an inverted index and a deterministic acyclic finite state automaton [1]. It was using TF-IDF as well for the rankings.

I began to experiment with a couple of my own ideas and used a trie rather than a DAFSA because it could be stored and loaded rather than generated at runtime. I also created and used the various linear methods described in the post to rank the documents.

[1] https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...

Consider BM25 as a newer substitute for tf-idf https://en.wikipedia.org/wiki/Okapi_BM25

Finally a post on search algorithms on HN again!

Pardon my surprise, but I think it's kind of strange that we see so little on HN of this topic, which is one of the pillars of Computer Science, and also at the core of one of the most successful IT companies.

I think we see surprisingly little of anything search related. In terms of full-scale solutions, there is basically just Elasticsearch and solr. When you look at lower level libraries available, there is very little outside the Lucene world.

You may be interested by Xapian (https://xapian.org/) which is a very solid alternative to Lucene.

There's also Sphinx search (written in C++ and not based on Lucene) and the newly Manticore search fork. (disclaimer: I was/am involved in both projects).

I believe that the search part of both ES and Solr is Apache Lucene

Yes, it's good software, but pretty much a monoculture.

Yes, and?

Yes, and I'd also like to see more posts/projects on the combination of search and NLP (natural language processing).

Doesn't SQLite of all things include an (optional) native full-text-search engine?

And now Vespa.

If you're a working programmer looking for a bit more depth on text searching, check out Taming Text by Ingersoll. It was written by some real pros, and it's very practical. It goes over everything from the first naive approaches to what drives many engines like Lucene today.


I'm a little surprised to not be seeing some classical search algorithms.

For online pattern searching, I'd want Boyer-Moore with the the Galil rule for guaranteed linear search. [Due to an unfortunate accident in C++ history, Boost and C++17 skip the Galil rule, leading to O(n+m) runtime vs. O(nm).]

For offline, Burrows-Wheeler has performance linear with query string and independent of the length of the searched document.

This isn't an exact pattern searching library. Wade originally used Boyer-Moore, but has since switched to a trie [1] structure to search for _terms_ rather than the whole query.

This allows for result documents that can contain some of the terms, all of the terms, or a prefix of a term. It's also pretty fast as it doesn't need to run a linear search through the documents. I'm not sure what you mean by "classical" search algorithms, but a trie based search is a pretty standard way of doing it.

[1] https://en.wikipedia.org/wiki/Trie

I understand, thank you. (Though a lot of inexact pattern matching is done with using exact pattern matching of substrings with the pidgeonhole principle.)

Tries led to suffix trees led to suffix arrays led to Burrows-Wheeler transform led to FM index, which is probably the best way to do an offline search.

If you haven't looked at BWT indexing, it's worth at least checking out. It preserves the advantages you speak of with linear memory cost with the search space. It's a lot more hairy to implement, unfortunately.

Certainly its worth taking a look at another javascript library which works for fuzzy search called, fuse, http://fusejs.io/.

This is neat, and reminds me a bit of the tricks used by RediSearch. I came across it on HN recently and have been geeking out on it since! http://redisearch.io

Lunr, named to correspond to Apache Solr, is also worth a look. https://lunrjs.com/

During the time of that post, this article was about a simpler example and only explained the trie structure. Wade didn't have the ranking system either. It has since been rewritten to include the methods used to rank the relevance of a term to the document, and to see how they are applied in a more complex example.

There must've been some updates, otherwise

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact