Hacker News new | past | comments | ask | show | jobs | submit login
A tiny static full-text search engine using Rust and WebAssembly (2019) (endler.dev)
365 points by jaden on June 10, 2020 | hide | past | favorite | 85 comments

I've been a fan of Matthias' project for a while. I learned about it soon after starting Stork: https://stork-search.net

They're very similar and share a lot of principles, though Matthias went full-on towards the algorithmic aspect and I focused on the experience of including the UI (copy-pastable from the code on the home page) and building a search index.

I think WASM-aided in-browser search is really exciting! There are clear benefits for content creators (embedding search on a Jamstack site has historically been tricky) and for users alike (Caching & offline support is pretty rad if your users are doing a lot of searching). I'm excited to see Matthias' project get attention here!

Does anyone else begin to feel like their role as a software developer is to maintain a mental search index of available techniques, languages, libraries, and metadata properties about each of them?

It's becoming so easy to compose software from available open source components, and migrate functionality (like full-text search) to different layers of the stack (and that's fantastic!).

It's just tricky to keep all the requirements and constraints (and implications) in mind when selecting the appropriate libraries :)

Don't try to keep them all in mind. You have to approach each problem/project holistically and include existing environment, maturity, etc in your evaluation process. Having too many uncurated choices can seem overwhelming, but it's a valuable skill to be able to recognize the difference between your needs/challenges and the ones of people toying around and/or solving very specialized problems. Admire the edge of the ecosystem with interest and intrigue, but apply practically.

> Does anyone else begin to feel like their role as a software developer is to maintain a mental search index of available techniques, languages, libraries, and metadata properties about each of them?

Yup. Every highly paid professional is highly paid for their knowledge and expertise. They need to be aware of all the relevant solutions, and the best way to apply them.

Top lawyers know about law, top doctors know about medicine, and they farm out the work to their less-highly paid nurses and paralegals.

When you reach the top of your field your main value is knowing what's out there and how to implement it, making sure more junior employees don't make mistakes along the way. (That is, if you stay on the IC track and don't go into management.)

Thanks; that's a straightforward and insightful way of thinking about it.

It makes me wonder whether some/all of this can be encoded into an actual search engine or constraint satisfier. Arguably it's the dream of the Expert System[1] all over again.

[1] - https://en.wikipedia.org/wiki/Expert_system

It's sad that we still don't have automatic interoperability between languages.

Someone should define a common API, and every language should adhere to it (or risk not be taken seriously). This is not trivial, since some languages have garbage collection, but it should be possible.

Like ImprobableTruth said, this isn't really possible without restricting the expressivity of the interop API or the set of supported languages. At least not on the function-call level.

A more flexible – though less efficient – approach would be a service-oriented protocol. You'd send requests in the form of messages (binary or text) over a byte-oriented bidirectional channel and receive the replies on the same channel. Unfortunately this approach would require more code to set up than primitive [1] function calls, and fine-grained interaction with the library would be harder.

[1] "primitive" as in lower-level, not as in dumb.

Dammit, now I can't think about anything else but how to design such a protocol, and how to generate adapters which translate between this protocol and the API of a library...

Edit: From the perspective of the interop protocol, it wouldn't make much difference if the library runs in the same address space or in a different process. Large blobs of data, like an picture or a long string, could be passed via pointers (in the same process) or via shared memory (in different processes).

Have a look at how Python programs can call C functions (you can load a dynamically linked library in Python):


You can store and manipulate C pointers within Python.

The other way around should also be possible: manipulate references to Python objects from within C.

I suppose this could be extended to Java objects, etc., and it could be based on a single API.

If you're trying to make an API for all programming languages, aren't you essentially just recreating something like the Java virtual machine but with your own biases and assumptions inserted?

You're misunderstanding my idea. Don't think "C ABI with higher-level types and objects", think "HTTP with more structure".

But it seems that kind of protocol would just be a way of telling a computer what to do, not how to do it. How would that be better than any other messaging format that exists?

Genuinely curious, because I don't fully understand this myself but the idea is interesting.

To be honest, I don't know. It was just a quick idea, and I'm increasingly less sure, whether it makes sense at all. Sorry to disappoint you :-(

Right, a way to broker requests between objects. Just a common architecture for that.

Exactly! But I don't know how to do that without either a) creating yet-another-standard which nobody supports, or b) requiring just as much effort as providing a C-based API for a given library.

You mean like protobuf or json or xml or...?

Garbage collectors ruin embeddability which is why people have been using C/C++ for all this time. The alternative is to serialize and deserialize like you are doing network communication.

I sadly think this is just not possible unless the common API is extremely restrictive, and for that case we already have C FFIs.

Even something relatively simple like generics just works so differently depending on the language. C++'s templates are just fundamentally incompatible with how Java generics work.

Forget about generics, even strings can't be agreed on. (Sometimes not even within a single language...)

There is a (relatively) common API: machine language.

Is that too architecture-specific? You could force all programming languages compile down to Java byte code, but that would be too restrictive. Not everyone wants to be bound by byte code.

You could create a language that aims to accomplish what you describe, but then you'd just have yet another programming language. (Insert relevant XKCD here.)

And humans either aren't smart enough to create a perfectly flexible language without warts that would work for people doing dramatically different things, or there are too many tradeoffs for those people to share the same den. Someone who has thought more about the philosophy of programming languages probably could answer that better than I could.

I'm not so sure about that. Just try to get CSS to display properly across browsers, or display content in an iFrame that resizes properly, on different devices. Edge case problems, galore.

I was looking for something similar (client side text search) and landed upon MiniSearch[0]. While it doesn't support some of the advanced features of lunr (like wildcard search), it was perfect for my needs. The accompanying blog post[1] explains the trade-offs pretty well.

0 - https://github.com/lucaong/minisearch

1 - https://lucaongaro.eu/blog/2019/01/30/minisearch-client-side...

Really cool. Reading through your incremental discoveries (aka going down the rabbit hole!) reminded me of my own adventure with building a typo tolerant search engine (you can see it here: https://github.com/typesense/typesense).

What began as a simple side project 4 years ago has consumed a significant part of my free time over the last couple of years.

Web assembly is certainly going to open a lot of new avenues for doing interesting things on the browser.

Really neat project and fantastic write-up. I always enjoy following the journeys of people who forge into the untamed lands of WASM.

I always find myself wishing I had a good excuse to use WASM for something, but never being able to find one, so it's exciting to see that you did! The fact is that JavaScript logic is rarely the bottleneck in web apps. And when it is, it's usually tangled up in UI rendering code that would be hard to tease out into WASM. You do bring up an interesting point, though, which I hadn't considered: WASM isn't just faster, it's smaller. That alone could make it useful in some cases where the speed may not be needed!

Thanks! On top of the size benefits, I love that I can finally use languages other than JavaScript on the frontend. I couldn't have done it in JS because I'd have to write a BloomFilter implementation in it (which I would not be capable of) or bundle an existing library, which would have increased code-size (hence, defeating the point of the project). Portability is the other big feature of wasm.

Great post! Doesn't it make sense to load the index separately, instead of a single bundle? RN the client would bust it's cache every time the content changes?

This was requested before and there even was work on a prototype that has since stalled. If you (or anyone else) is interested, please check out https://github.com/mre/tinysearch/pull/37. Maybe we can get this done in a future version. :)

I've always been curious about this. What's the best practice for loading a large JSON file for large sets of search results? I believe when working with lunr in the past, I ended up making large network requests to load the entire JSON file at once. What's the proper way to deal with this?

Once your website reaches a certain size, the JSON will be too big to load. Then you'll have to offload the search request to a server. Either self-hosted, or a service like Algolia.

You can probably push that "certain size" a long way down the road if you need to. (at least in this specific client side text search case, rather than a generic "how do I server large json files" way)

If you tweak the search box so it doesn't do anything until you've typed at least 2 or 3 letters, you could then serve regenerated son files that only contain that matches that have that prefix... No need for any of the json payload for words/phrases starting with aa..rt if someone's typed "ru" into the search box.

That means you'd have 676 distinct json files, but you'd only ever load the one you need...

But that requires a network request after typing to get results, which is about the same user experience of a search bar that requests some search API.

> requires a network request

Seems to me that often, though not always, this network request would happen whilst the user is typing — say, busy typing chars 3, 4 and 5. That the netw req won't be noticeable for the human

And, if typing more chars, or backspace-deleting to fix a typo ... no netw request required.

And the same scenario, if typing a 2nd word.

I'm guessing in like 90% of the cases, it'll seem as if there was never a netw req.

True, but it'd still allow your site to be a collection of static files instead of needing a executable running on a backend somewhere.

Push the corpus into SQLite, it has built-in FTS engines[^1]. Then serve it with anything. Unfortunately this needs server side code, but like 30 lines of PHP.

[^1]: https://www.sqlite.org/fts5.html

You can do SQLite in the browser but it’ll have to download the entire dB file instead of only opening the pages it needs (because the naive JS port can’t convert page requests to range requests).

It should be possible to support loading only the required pages on the browser with SQLite compiled to WASM along with a custom VFS implementation. Here’s a project[1] which does something similar (selectively load the SQLite DB on demand), albeit with the SQLite file in a torrent.

[1]: https://github.com/bittorrent/sqltorrent

Wow this is great, thank you for the tip! My site uses SQLite3 for storage so this is perfect.

I'm continually amazed at how featureful SQLite is.

Did anyone try to use IndexedDB instead of a big load of json with a separated json approach? I mean a bunch of static json files, that are only loaded if not already in indexeddb... like patches?


JSON may be a very bandwidth-inefficient format. A format that can be parsed from a stream could save RAM and bandwidth, especially on mobile which is most constrained.

However it compresses very well on the wire. One of the simplest "streamed-json" solutions is to have one json per line of input and parse that incrementally.

One thing you can do is make the data itself more concise:

- Stripping all whitespace from formatted JSON can make a huge difference

- Making property names shorter (these get repeated for every element in a large dataset!)

- If your data is relatively flat, you could replace an array of objects with an array of arrays

Or you could go all the way and serve data in CSV format, which is very space-efficient and has the neat property of being tolerant of being broken up into pieces. Though it may not parse as quickly, since JSON has native parsing support from the browser.

JSON can be compressed very well and while your advices are very good, once you compress it's not very important anymore.

Sometimes you can partition large sets into multiple ones. If you add this extra level of indirection, you don't need to retrieve all data, just the data that is being used, which should be small fraction for some problems that work well with it. But many problems can't be solved by this approach.

I guess you can always load only the index in the browser, isn't it?

So once this loads, it sounds like it could be made to work offline. That might open some interesting possibilities.

Wasm can be cached!

Couldn't find the search on mobile :(

I found it, but couldn't make it work. Pixel 3a running Android 10 and the stock Chrome browser. Hitting enter on the search field did nothing and I can't see any other submit button. Then again, 10% of the search field is also missing.

On second look, it's real-time. No need to submit. The results just blend into the page so I thought it was broken.

Sorry to hear that. Not an expert, but if you have any ideas on how to improve the UX I'd be thankful.

It is at the top right in a search icon. Right underneath his about me link.

How about putting this index and search logic into a CloudFlare worker?


Then you can upload index up to 1MB and still have decent performance https://developers.cloudflare.com/workers/about/limits/#numb...

That's a good idea. In my case, I wanted a static search that I could deploy next to my content. Cloudflare workers would require a (free) account, but most importantly they wouldn't work full offline. For bigger indices, that would be a great trade-off, though. If you like, you can try pushing tinysearch to a worker using wasm-pack. It's all Rust in the end, so you'd only need to add a `/search` route e.g. with hyper (https://github.com/hyperium/hyper). If you're willing to experiment with this, don't hesitate to open a pr/issue on Github and we can add that feature.

It would be interesting to see a hybrid approach:

* server side WASM such as cloudflare workers and kv to build and maintain the index

* streaming copy of the simplified index to be pulled in by a browser-side wasm

* queries that go beyond the simple index forwarded to the worker

One way of simplifying would be to limit search terms to a certain length, or only expose the most popular results.

By sharing wasm code the format can be optimized and not require a compatibility layer or serdes.

How does it compare with flexsearch? It claims to be the fastest, smallest, prettiest search library in town. https://github.com/nextapps-de/flexsearch

this one is 100% client side, flexsearch is client-server.

I guess for bigger indexes not gonna work out, as the payload will be huge and it pushes all the work to the client.

Nope.. Was looking into flex search today.. is all client side

Today's thread on the other search engine: https://news.ycombinator.com/item?id=23473365

Thanks for describing your process as well as the tool, jaden! I appreciate your pursuit of efficiency in the download and implementation. This is inspiring me to add Wasm to my Rust usage.

That is, thanks to Matthias. But thanks for the post, jaden

this should be smaller than sth like lunr.js?


potentially much smaller, since you don't need to bundle the full content of all articles to be able to search them.

How does this compare to a full reverse index? I would expect a full index to be much simpler to implement and would compress better.

Still very impressive work, and gives me a new reason to learn Rust.

Author here. Tried a full reverse index first but it's much bigger in size - think around two orders of magnitude, if I remember correctly.

I'm also curious, how does zola compare to jekyll?


+ plugin support

+ large community with lots of themes/plugins

- need to install ruby and dependencies

- slow to build large sites


+ easy install (precompiled binary)

+ fast

- smaller feature set and community

- no plugins

this search does not work, but I enjoy your enthusiasm.

In terms of not performing what the user might expect from search behaviour, an example I found was the following:

A word "elasticlunr", appears in the linked article, and the linked article appears in search results, but searching any partial string such as "elastic", "elasticl" "elasticlu" and "elasticlun" will not result in finding the linked article. Perhaps this behaviour is intended by the author, but it may not be intended by the various users of the site.


> elastic* and elasticl*

does find the linked article, but

> elasticlu* and elasticlun*

do not.

The article mentions several times that it only matches on whole words

Also the search index has not been updated in 8 months so it doesn't include the several recent articles. Which can be confusing, since those articles are right next to the search box when you're at the homepage. I opened a github issue for him.

Thanks for the heads-up; will fix.

The reason is, that I'm working on decoupling the search frontend from the JSON search blobs. Want to make the frontend-part installable through npm as well (and not just cargo as it is now). Didn't get around to adding the search index generation to Github actions yet due to limited time. Here's the pipeline if you want to give me a hand and add the tinysearch build: https://github.com/mre/mre.github.io/blob/source/.github/wor...

Yeah, the search [rust shit] finds three articles which entirely fail to mention "shit".

Interesting use of zola, been thinking about trying gatsby.js perhaps Zola as well now. Has anyone used either?

Author here. I love and use both. For static sites, Zola is super nice: it's fast, has a great community, no thrills, and it gets better all the time. The config options are sensible and the templating engine is easy to learn. Sites I'm maintaining with Zola: https://endler.dev/, https://hello-rust.show/

For dynamic content, I love Gatsby. It can read from whatever you throw at it: Files (Markdown, plaintext, YAML), APIs (e.g. Github),... I'm not a frontend dev so my React knowledge is quite limited, but it's enough for simple components. Sites I'm maintaining with Gatsby: https://analysis-tools.dev/

Creator of Zola so a bit biased. Zola actually comes with an automatic search engine built-in that you can see on https://www.getzola.org/

I haven't used Gatsby.js as doing React/GraphQL for my static site isn't really something I'm interested in. If your site is very dynamic/pulls from various remote sources or just want to use React it's better than Zola.

Just started using Zola recently. Early, but after comparing with several other engines it seemed the best suited to my application. So far I'm happy with it.

I'm a big fan of Zola. When I need more features, I'd reach for Hugo before Jekyll. But for most simple static sites, Zola is my favorite.

Which features are you missing the most?

Great stuff, but it doesn't seem to search titles

Yeah, that's a bug. XD I was ingesting the title into the bloomfilter without making it lowercase like the rest. Then when searching, I lowercase the user input and guess what... the title can't be matched. Whoops. ;) Will fix.

Thanks Matthias! I learned a lot from your YouTube channel on Rust. One of my favorite tech channels ever.

Awww. Thanks so much. I suffer from extreme impostor syndrome, which is one reason why I didn't continue making shows. Hearing that people actually learned something is heart-warming. If anyone is interested, the old episodes are here: hello-rust.show

I truly mean it. Your channel is one of the best real tech channels I've ever seen.

Link to said channel please?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact