
Inside Wade, a 1kb search library - kbr
https://blog.kabir.ml/posts/inside-wade.html
======
vanderZwan
It just so happens that I've been working on trying to build the fastest
possible trie implementation in JavaScript to speed up `lz-string` and
potentially `js-search`. The whole process I went through is too much to write
down here, but you can basically read it here:

[https://github.com/pieroxy/lz-string/pull/98](https://github.com/pieroxy/lz-
string/pull/98)

[https://github.com/bvaughn/js-
search/issues/42](https://github.com/bvaughn/js-search/issues/42)

The short version is: replace those single-char strings with the result of
`charCodeAt()`, and then use `charCodeAt()` during search too. The reason is
that integer key lookup is much faster than string key lookup:

[https://stackoverflow.com/questions/10639488/faster-to-
acces...](https://stackoverflow.com/questions/10639488/faster-to-access-
numeric-property-by-string-or-integer/10639526#10639526)

It's important that _all_ of your keys are integers though, so the trick is to
actually store/look up `charCodeAt()+1`, and then store the indices at key 0.

If you feel like taking a risk, you can make a slightly more complicated data
structure where you replace the object with an array and do a linear search:

[https://jsfiddle.net/vanderZwan/ccfxv4sg/1/](https://jsfiddle.net/vanderZwan/ccfxv4sg/1/)

For most data this is the fastest, because the array will only be a handful of
characters (except at the root). However, in certain special cases
(specifically, if you have a lot of unique characters at the root) it becomes
prohibitively slow.

~~~
olivernn
Using `charCodeAt()` for numeric keys is an interesting optimisation, I'm
definitely going to try that out with Lunr.js.

I'm also interested in _why_ that is quicker though, aren't all object keys
converted to strings anyway?

~~~
vanderZwan
That is what the spec says, but one way to interpret that is to say that they
only have to _behave_ as if they were.

Hashing an integer or double can be much simpler than hashing a string, and
like the other comment said: for small enough keys you might even get away
with using an array.

edit: to give another example, the spec also says all number are doubles, and
yet all JS modern engines use optimisations in places where it can guarantee a
value is a 32-bit integer; in some cases adding `|0` (a forced truncation by
bitmask-or zero) as a compiler hint can speed up code.

edit2: this is also part of why you want your objects to be as type-stable as
possible (it also makes code less bug-prone that way because easier to reason
about). This includes the shape of _objects_. See these slides from 2011:

[https://www.slideshare.net/newmovie/know-yourengines-
velocit...](https://www.slideshare.net/newmovie/know-yourengines-
velocity2011/126-Typestable_JS_examples_Typestable_NOT)

------
gandreani
What's the memory consumption like? A node for each letter seems like it would
result in substantial memory consumption. I was thinking an object pool might
help here, but I'm not sure if its possible to use it given the nesting of the
nodes

Also, if the library removes stop words, does that mean its limited in which
languages it supports? I'm assuming only English stop words are filtered

Not being critical here, just genuinely curious and don't have time to look
through the code ATM

~~~
olivernn
Lunr.js [1] used to use a trie data structure and memory usage was a concern.
The problem is that although the data structure compresses prefixes, it does
nothing for suffixes. Depending on the corpus this can lead to a lot of
duplication.

Switching to a trie-like structure that compresses prefixes and suffixes can
lead to significant savings. Building the structure can be a bit more
burdensome, so there is a trade off there. There is a paper describing the
approach [2] and, if you're interested, my JavaScript implementation [3][4].

[1] [https://lunrjs.com](https://lunrjs.com)

[2]
[http://www.aclweb.org/anthology/J00-1002.pdf](http://www.aclweb.org/anthology/J00-1002.pdf)

[3]
[https://github.com/olivernn/lunr.js/blob/master/lib/token_se...](https://github.com/olivernn/lunr.js/blob/master/lib/token_set.js)

[4]
[https://github.com/olivernn/lunr.js/blob/master/lib/token_se...](https://github.com/olivernn/lunr.js/blob/master/lib/token_set_builder.js)

~~~
gandreani
Cheers! Thanks for sharing. EDIT: I like the site design too!

------
venning
> _In the end, we will be left with a results array like:_
    
    
      [1, 1]
    

Is that supposed to be [0, 1]? If not, how does that correlate to _[ "Hey",
"Hello"]_?

~~~
kbr
Actually, it is supposed to be [1, 1]. The actual value of the element
corresponds to the score it has, and the index of it refers to the index in
the data.

I'll add a note in the article.

~~~
venning
I'm sorry, I'm still a bit confused. If it "refers to the index in the data",
how is it referring to both "Hey" and "Hello"?

Running the following doesn't seem to work for me:

    
    
      var Wade = require('wade');
      var search = Wade(['Hey', 'Hello']);
      search('he'); // returns []

~~~
kbr
The article doesn't exactly simulate what Wade returns, but uses a simpler
version. I added a note about it now.

    
    
        [1, 1]
    

The item at index 0 is 1. This means that the item in index 0 of the data
("Hey") has a score of 1. The item at index 1 is 1, meaning that the item at
index 1 of the data ("Hello") has a score of 1.

Edit: Your example with Wade isn't working because it ignores the first item
in the query ("he") as it is a stop word. Searching for "h" will return both.

~~~
venning
Okay, I gotcha with the [1, 1].

I'm still curious about using it. The GitHub example [1] works as described,
but I can't get a result from the 'he' example. Is it saying no match was
found by returning an empty array?

EDIT: Okay, saw your edit. Might be nice to use an example that's not a
stopword. ;) Nice work on the library and the blog post.

[1]
[https://github.com/KingPixil/wade#usage](https://github.com/KingPixil/wade#usage)

~~~
kbr
Cool, glad you understood. The library returns an array of results, and will
return an empty array if there are none. Try searching for "h" or "hey" for an
example.

------
amelius
This is nice. How much work would it be to make this incremental, i.e. update
the data structure whenever new data is added or data is changed?

PS: Am I the only one who thinks it is quite strange that "search", being one
of the pillars of CS, has so few open-source libraries dedicated to it?

~~~
kbr
It wouldn't take much work to insert data actually, in fact it would be done
pretty fast. Updating something will take more time, as it needs to update the
whole structure.

I totally agree, search is an extremely interesting topic and I learned a lot
writing Wade. I'd definitely recommend people to learn more about how this
kind of stuff works.

------
jarym
Very interesting write-up - I use DefiantJS
[http://defiantjs.com](http://defiantjs.com) regularly and it seems to work
similarly but instead it pre-processes the data into XML and then uses (fast)
in-browser xpath to do searched

~~~
kbr
Thanks. DefiantJS uses a different way of searching because it needs to be
able to search JSON. Wade is focused on a simpler use case of searching within
an array of strings.

------
kevinsimper
Really nice explanation on how it works, I think a lot of people could benefit
by understanding how a search engine works!

~~~
kbr
Yeah, it's really interesting stuff! I explored a variety of different
algorithms when first writing Wade, including Knuth-Morris-Pratt and Boyer-
Moore. In the end I ended up using a trie as it has great performance at the
cost of a little preprocessing.

------
samayshamdasani
Hey Kabir! Congrats on making the frontpage :) I loved reading this post a
couple days back.

~~~
kbr
Thanks! Glad you liked it.

------
mzh0
it'll be really helpful if you explicitly list its supported language.

~~~
kbr
Well, for future reference, the library itself is for Javascript. The post is
more focused on how it works, rather than what language it uses.

~~~
mzh0
i was asking the language it supports, i.e. English, Spanish, Chinese, etc

~~~
kbr
My bad, it supports English, but the only feature dependent on this is the
part that removes stop words.

You can easily configure the stop words that Wade removes by editing
`Wade.config.stopWords`.

~~~
ash_gti
It will likely have trouble with languages that don't include spaces around
words. This will likely work for most latin based languages though.

