
A tiny static full-text search engine using Rust and WebAssembly (2019) - jaden
https://endler.dev/2019/tinysearch/
======
jil
I've been a fan of Matthias' project for a while. I learned about it soon
after starting Stork: [https://stork-search.net](https://stork-search.net)

They're very similar and share a lot of principles, though Matthias went full-
on towards the algorithmic aspect and I focused on the experience of including
the UI (copy-pastable from the code on the home page) and building a search
index.

I think WASM-aided in-browser search is really exciting! There are clear
benefits for content creators (embedding search on a Jamstack site has
historically been tricky) and for users alike (Caching & offline support is
pretty rad if your users are doing a lot of searching). I'm excited to see
Matthias' project get attention here!

------
jka
Does anyone else begin to feel like their role as a software developer is to
maintain a mental search index of available techniques, languages, libraries,
and metadata properties about each of them?

It's becoming so easy to compose software from available open source
components, and migrate functionality (like full-text search) to different
layers of the stack (and that's fantastic!).

It's just tricky to keep all the requirements and constraints (and
implications) in mind when selecting the appropriate libraries :)

~~~
amelius
It's sad that we still don't have automatic interoperability between
languages.

Someone should define a common API, and every language should adhere to it (or
risk not be taken seriously). This is not trivial, since some languages have
garbage collection, but it should be possible.

~~~
adwn
Like ImprobableTruth said, this isn't really possible without restricting the
expressivity of the interop API or the set of supported languages. At least
not on the function-call level.

A more flexible – though less efficient – approach would be a service-oriented
protocol. You'd send requests in the form of messages (binary or text) over a
byte-oriented bidirectional channel and receive the replies on the same
channel. Unfortunately this approach would require more code to set up than
primitive [1] function calls, and fine-grained interaction with the library
would be harder.

[1] "primitive" as in _lower-level_ , not as in _dumb_.

~~~
adwn
Dammit, now I can't think about anything else but how to design such a
protocol, and how to generate adapters which translate between this protocol
and the API of a library...

Edit: From the perspective of the interop protocol, it wouldn't make much
difference if the library runs in the same address space or in a different
process. Large blobs of data, like an picture or a long string, could be
passed via pointers (in the same process) or via shared memory (in different
processes).

~~~
asdfman123
If you're trying to make an API for all programming languages, aren't you
essentially just recreating something like the Java virtual machine but with
your own biases and assumptions inserted?

~~~
adwn
You're misunderstanding my idea. Don't think "C ABI with higher-level types
and objects", think "HTTP with more structure".

~~~
asdfman123
But it seems that kind of protocol would just be a way of telling a computer
_what_ to do, not _how_ to do it. How would that be better than any other
messaging format that exists?

Genuinely curious, because I don't fully understand this myself but the idea
is interesting.

~~~
adwn
To be honest, I don't know. It was just a quick idea, and I'm increasingly
less sure, whether it makes sense at all. Sorry to disappoint you :-(

------
krut-patel
I was looking for something similar (client side text search) and landed upon
MiniSearch[0]. While it doesn't support some of the advanced features of lunr
(like wildcard search), it was perfect for my needs. The accompanying blog
post[1] explains the trade-offs pretty well.

0 -
[https://github.com/lucaong/minisearch](https://github.com/lucaong/minisearch)

1 - [https://lucaongaro.eu/blog/2019/01/30/minisearch-client-
side...](https://lucaongaro.eu/blog/2019/01/30/minisearch-client-side-
fulltext-search-engine.html)

------
karterk
Really cool. Reading through your incremental discoveries (aka going down the
rabbit hole!) reminded me of my own adventure with building a typo tolerant
search engine (you can see it here:
[https://github.com/typesense/typesense](https://github.com/typesense/typesense)).

What began as a simple side project 4 years ago has consumed a significant
part of my free time over the last couple of years.

Web assembly is certainly going to open a lot of new avenues for doing
interesting things on the browser.

------
_bxg1
Really neat project and fantastic write-up. I always enjoy following the
journeys of people who forge into the untamed lands of WASM.

I always find myself wishing I had a good excuse to use WASM for something,
but never being able to find one, so it's exciting to see that you did! The
fact is that JavaScript logic is rarely the bottleneck in web apps. And when
it is, it's usually tangled up in UI rendering code that would be hard to
tease out into WASM. You do bring up an interesting point, though, which I
hadn't considered: WASM isn't just faster, it's _smaller_. That alone could
make it useful in some cases where the speed may not be needed!

~~~
omn1
Thanks! On top of the size benefits, I love that I can finally use languages
other than JavaScript on the frontend. I couldn't have done it in JS because
I'd have to write a BloomFilter implementation in it (which I would not be
capable of) or bundle an existing library, which would have increased code-
size (hence, defeating the point of the project). Portability is the other big
feature of wasm.

------
craig
Great post! Doesn't it make sense to load the index separately, instead of a
single bundle? RN the client would bust it's cache every time the content
changes?

~~~
omn1
This was requested before and there even was work on a prototype that has
since stalled. If you (or anyone else) is interested, please check out
[https://github.com/mre/tinysearch/pull/37](https://github.com/mre/tinysearch/pull/37).
Maybe we can get this done in a future version. :)

------
prayze
I've always been curious about this. What's the best practice for loading a
large JSON file for large sets of search results? I believe when working with
lunr in the past, I ended up making large network requests to load the entire
JSON file at once. What's the proper way to deal with this?

~~~
wereHamster
Once your website reaches a certain size, the JSON will be too big to load.
Then you'll have to offload the search request to a server. Either self-
hosted, or a service like Algolia.

~~~
pmlnr
Push the corpus into SQLite, it has built-in FTS engines[^1]. Then serve it
with anything. Unfortunately this needs server side code, but like 30 lines of
PHP.

[^1]: [https://www.sqlite.org/fts5.html](https://www.sqlite.org/fts5.html)

~~~
ComputerGuru
You can do SQLite in the browser but it’ll have to download the entire dB file
instead of only opening the pages it needs (because the naive JS port can’t
convert page requests to range requests).

~~~
woadwarrior01
It should be possible to support loading only the required pages on the
browser with SQLite compiled to WASM along with a custom VFS implementation.
Here’s a project[1] which does something similar (selectively load the SQLite
DB on demand), albeit with the SQLite file in a torrent.

[1]:
[https://github.com/bittorrent/sqltorrent](https://github.com/bittorrent/sqltorrent)

------
nmstoker
So once this loads, it sounds like it could be made to work offline. That
might open some interesting possibilities.

~~~
whb07
Wasm can be cached!

------
bitskyx
How about putting this index and search logic into a CloudFlare worker?

[https://developers.cloudflare.com/workers/templates/pages/he...](https://developers.cloudflare.com/workers/templates/pages/hello_world_rust/#resources)

~~~
bitskyx
Then you can upload index up to 1MB and still have decent performance
[https://developers.cloudflare.com/workers/about/limits/#numb...](https://developers.cloudflare.com/workers/about/limits/#number-
of-scripts)

~~~
omn1
That's a good idea. In my case, I wanted a static search that I could deploy
next to my content. Cloudflare workers would require a (free) account, but
most importantly they wouldn't work full offline. For bigger indices, that
would be a great trade-off, though. If you like, you can try pushing
tinysearch to a worker using wasm-pack. It's all Rust in the end, so you'd
only need to add a `/search` route e.g. with hyper
([https://github.com/hyperium/hyper](https://github.com/hyperium/hyper)). If
you're willing to experiment with this, don't hesitate to open a pr/issue on
Github and we can add that feature.

~~~
tmzt
It would be interesting to see a hybrid approach:

* server side WASM such as cloudflare workers and kv to build and maintain the index

* streaming copy of the simplified index to be pulled in by a browser-side wasm

* queries that go beyond the simple index forwarded to the worker

One way of simplifying would be to limit search terms to a certain length, or
only expose the most popular results.

By sharing wasm code the format can be optimized and not require a
compatibility layer or serdes.

------
hfourm
Couldn't find the search on mobile :(

~~~
codazoda
I found it, but couldn't make it work. Pixel 3a running Android 10 and the
stock Chrome browser. Hitting enter on the search field did nothing and I
can't see any other submit button. Then again, 10% of the search field is also
missing.

~~~
codazoda
On second look, it's real-time. No need to submit. The results just blend into
the page so I thought it was broken.

~~~
omn1
Sorry to hear that. Not an expert, but if you have any ideas on how to improve
the UX I'd be thankful.

------
Luff
How does it compare with flexsearch? It claims to be the fastest, smallest,
prettiest search library in town. [https://github.com/nextapps-
de/flexsearch](https://github.com/nextapps-de/flexsearch)

~~~
PaywallBuster
this one is 100% client side, flexsearch is client-server.

I guess for bigger indexes not gonna work out, as the payload will be huge and
it pushes all the work to the client.

~~~
rraghur
Nope.. Was looking into flex search today.. is all client side

------
kragen
Today's thread on the other search engine:
[https://news.ycombinator.com/item?id=23473365](https://news.ycombinator.com/item?id=23473365)

------
steffan
Thanks for describing your process as well as the tool, jaden! I appreciate
your pursuit of efficiency in the download and implementation. This is
inspiring me to add Wasm to my Rust usage.

~~~
steffan
That is, thanks to Matthias. But thanks for the post, jaden

------
tuananh
this should be smaller than sth like lunr.js?

[https://lunrjs.com/guides/getting_started.html](https://lunrjs.com/guides/getting_started.html)

~~~
Groxx
potentially _much_ smaller, since you don't need to bundle the full content of
all articles to be able to search them.

------
steventhedev
How does this compare to a full reverse index? I would expect a full index to
be much simpler to implement and would compress better.

Still very impressive work, and gives me a new reason to learn Rust.

~~~
omn1
Author here. Tried a full reverse index first but it's much bigger in size -
think around two orders of magnitude, if I remember correctly.

------
unwoundmouse
I'm also curious, how does zola compare to jekyll?

~~~
guu
jekyll:

\+ plugin support

\+ large community with lots of themes/plugins

\- need to install ruby and dependencies

\- slow to build large sites

zola:

\+ easy install (precompiled binary)

\+ fast

\- smaller feature set and community

\- no plugins

------
blairanderson
this search does not work, but I enjoy your enthusiasm.

~~~
pryce
In terms of not performing what the user might expect from search behaviour,
an example I found was the following:

A word "elasticlunr", appears in the linked article, and the linked article
appears in search results, but searching any partial string such as "elastic",
"elasticl" "elasticlu" and "elasticlun" will not result in finding the linked
article. Perhaps this behaviour is intended by the author, but it may not be
intended by the various users of the site.

Oddly,

> elastic* and elasticl*

does find the linked article, but

> elasticlu* and elasticlun*

do not.

~~~
ricket
Also the search index has not been updated in 8 months so it doesn't include
the several recent articles. Which can be confusing, since those articles are
right next to the search box when you're at the homepage. I opened a github
issue for him.

~~~
omn1
Thanks for the heads-up; will fix.

The reason is, that I'm working on decoupling the search frontend from the
JSON search blobs. Want to make the frontend-part installable through npm as
well (and not just cargo as it is now). Didn't get around to adding the search
index generation to Github actions yet due to limited time. Here's the
pipeline if you want to give me a hand and add the tinysearch build:
[https://github.com/mre/mre.github.io/blob/source/.github/wor...](https://github.com/mre/mre.github.io/blob/source/.github/workflows/ci.yml)

------
boromi
Interesting use of zola, been thinking about trying gatsby.js perhaps Zola as
well now. Has anyone used either?

~~~
steffan
Just started using Zola recently. Early, but after comparing with several
other engines it seemed the best suited to my application. So far I'm happy
with it.

~~~
lwhsiao
I'm a big fan of Zola. When I need more features, I'd reach for Hugo before
Jekyll. But for most simple static sites, Zola is my favorite.

~~~
Keats
Which features are you missing the most?

------
npiit
Thanks Matthias! I learned a lot from your YouTube channel on Rust. One of my
favorite tech channels ever.

~~~
omn1
Awww. Thanks so much. I suffer from extreme impostor syndrome, which is one
reason why I didn't continue making shows. Hearing that people actually
learned something is heart-warming. If anyone is interested, the old episodes
are here: hello-rust.show

~~~
npiit
I truly mean it. Your channel is one of the best real tech channels I've ever
seen.

~~~
deathtrader666
Link to said channel please?

~~~
omn1
[https://www.youtube.com/hellorust](https://www.youtube.com/hellorust)

------
bepvte
Great stuff, but it doesn't seem to search titles

~~~
omn1
Yeah, that's a bug. XD I was ingesting the title into the bloomfilter without
making it lowercase like the rest. Then when searching, I lowercase the user
input and guess what... the title can't be matched. Whoops. ;) Will fix.

