
Writing a full-text search engine using Bloom filters (2013) - memexy
https://www.stavros.io/posts/bloom-filter-search-engine/
======
praveen9920
IMDB had (not sure about current situation) an elegant but not so
sophisticated solution for movie search on their website

They generated lot of json files with all the permutations of alpha numerics
upto three characters( like a.json, b.json, aa.json, sta.json), when user
types initial character they fetch the corresponding json which has more
relevant results.

They do "actual" search only after three characters but most of the time you
would find the result in first three characters. This is not a full-text
search but has its own advantages.

Bloom filters seems to have similar advantages.

~~~
sudhirj
For those who want to do something like this, the word / library you'll need
to search for is "trigrams". Make trigrams of all your relevant search terms,
then run a job to populate a JSON to your S3/CDN for each of the trigrams,
with the most relevant documents sorted inside each one. You have a
theoretical max of 26x26x26 = 17K files, but usually many possible trigrams
aren't used in English.

Can keep modifying the list over time to take into account, recency, new data,
etc. But this is a pretty scalable solution for the most part. Rebuilding your
entire index on S3 should still cost you less than a dollar.

------
cetra3
I feel like using a straight bloom filter will get you some false positives
and may not be a right fit, but it is an interesting take & might be good for
a first-pass.

mdbook ([https://github.com/rust-lang/mdBook](https://github.com/rust-
lang/mdBook)) uses elasticlunr
([http://elasticlunr.com/](http://elasticlunr.com/)) as an inverted index and
is generated statically.

The same is done for zola
([https://www.getzola.org/documentation/content/search/](https://www.getzola.org/documentation/content/search/)).

~~~
inertiatic
Lunr and elasticlunr are awesome projects. I was part of a search team and we
were banging our heads against the wall trying to figure a way to serve a new
feature request (basically some advanced autocomplete functionality) in a way
that fit in with what we were currently maintaining.

It took me literally half an hour to implement this using elasticlunr on the
browser. We will went with the backend solution as it felt more extendable,
but this got me thinking there's some abuse of ES for things you could easily
do that way.

~~~
acidbaseextract
What's your feeling on the status of elasticlunr as a project? I've been eying
the the functionality for a project, but it hasn't received updates in quite a
few years.

~~~
lukevp
I have yet to see development of a js search engine continue for the long
term. Used FlexSearch in a recent project but it too appears to have been
abandoned. Mini search seems nice and it’s a bit newer. It has a focus on low
memory usage so it may also work well for this scenario of a static index.

------
6510
I took letter combinations aa ab ac ... zx zy zz z0 z1 z2 ... Created a file
for each letter combination aa.dat ... zz.dat ...

(26 letters + 10 digits) * (26 letters + 10 digits + space/wildcard) = 1369

Peanuts for Apache.

Each of my 10 000ish articles got a single bit in those files where the offset
is the article number.

A search query like foo bar would involve the files " f", "fo", "oo", "o ", "
b", "ba", "ar", "r "

These are only 8 files! for a whooping 8 kb-ish of data.

We then perform the AND operation. 0000000 01000000 00000000

Remaining articles are pulled [in reverse order] and checked for the presence
of the exact phrase or the presence of keywords.

The php then flushes

    
    
      '<br><a href="'.$articleNum.'.html">'.$title.'</a>'
    

to the page.

(I also did a version with a file with url's in it. Then the first url is
article 1 or bit 1. Fun for stuff linked to from the blog)

For longer articles one could make 3 digit files. 37x37x37 is little over 50
000 files. Each query only involves a few of those.

Sadly one cant push bits to files but for a new article bits are to be
appended to each data file eventually. I just add the new articles as false
positives until I have a bunch of them.

The advantage of the disadvantage in that each page has to be checked for the
keywords is that if it is gone it wont show up in the results. Even a highly
accurate search solution shall have false positives.

[http://blog.go-here.nl/search-blog.php?q=aliens](http://blog.go-
here.nl/search-blog.php?q=aliens)

view source for laughs

~~~
compressedgas
You have 2008 script tags on that page. I know why but you could have at least
had them remove themselves from the DOM after they were done adding their
content.

> Sadly one cant push bits to files but for a new article bits are to be
> appended to each data file eventually.

You can it is just that each of the files has to be padded out to a multiple
of full bytes. Just read the last byte and OR in the new bit at its offset.

The method you are using is called q-gram filtration.

The main optimization you could do is to use the specificities of the grams to
prioritize the ANDing so that the most specific grams are chosen first. This
can allow early stopping for misses.

You can also use multiple threads to read and AND in parallel after a AND tree
order has been decided from the specificities.

I would also remark that it is possible to implement approximate matching or
error tolerant search with q-grams. That is to rank results according to edit
distance or Levenshtein distance from the query.

It is also possible to do regular expression search over them as Russ Cox
explains how he implement the Google Code Search: Regular Expression Matching
with a Trigram Index or How Google Code Search Worked
[https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)
This is not specific to 3-grams. It can be made to work with any gram size or
even indexes consisting of variable size grams.

~~~
6510
> You have 2008 script tags on that page. I know why but you could have at
> least had them remove themselves from the DOM after they were done adding
> their content.

Unless I've missed something I don't think it works like that. It is just a
html document, it comes as is. If you modify the dom the source stays the
same?

Also, the html parser does a wonderful job ignoring things like comments and
display:none nodes / script nodes. It just doesn't do anything with them.
After the innerHTML is modified there is nothing further to do with each
script. It is just ignored.

I've tried things like adding a 5 mb comment to the end of a page to see the
behavior.

------
bawolff
Reminds me a lot of some of the research into distributed full text search in
p2p networks (e.g.
[http://rboutaba.cs.uwaterloo.ca/Papers/Journals/2009/Reaz09....](http://rboutaba.cs.uwaterloo.ca/Papers/Journals/2009/Reaz09.pdf)
)

Probably some of the lessons from that domain could be applied here

~~~
memexy
Abstract mentions bloom filters directly so your analogy makes sense

> Efficient discovery of information, based on partial knowledge, is a
> challenging problem faced by many large scale distributed systems. This
> paper presents Plexus, a peer-to-peer search protocol that provides an
> efficient mechanism for advertising a bitsequence (pattern), and discovering
> it using any subset of its 1-bits. A pattern (e.g., Bloom filter) summarizes
> the properties (e.g., keywords, service description) associated with a
> shared object (e.g., document, service).

But I think that paper is more about discovery than search. It would be good
for a tagging system but I didn't read enough to know if it would be useful
for a more general search index.

~~~
bawolff
Search in essence is just trying to find the document that has all the right
keywords.

Well also trying to rank them but that's kind of harder in this context.

------
eddieoz
I'm building a decentralised blog, blockchain + ipfs based.

As I'm using markdown as the example above, probably I can apply this method
to update the search database index when the post is published, and then
enable the client-side search in the platform.

I'll try it using this approach.

[https://github.com/eddieoz/549-personal-dx-
blog](https://github.com/eddieoz/549-personal-dx-blog) dex blog: eddieoz.com
(or eddieoz.crypto using unstoppable extension)

~~~
StavrosK
Elasticlunr sounds like a more mature solution, maybe you'd like to try that.

------
nurettin
How is this (2013)? Content has comments about 2018 node ecosystem.

~~~
StavrosK
It came to me in a dream long ago.

------
byteface
Thanks for sharing, I'd heard of bloom filters but not really looked into
them. I also experimented years ago in client side search on gpu. seems very
doable. I didn't use any datastructures but considered it. started a boyer
moore like shader here...
[https://github.com/byteface/GTP](https://github.com/byteface/GTP)

works really well. even 9 years later. doubt anything compares.

------
kragen
I like the idea of just downloading a summary file of the whole website and
searching it on the client side. It seems like it can only scale up to some
limited extent — it should be feasible for 1000 or 10000 pages, but maybe once
you have 100 000 pages it will start to be a bit top-heavy.

In 2006 I built the regular kind of posting-list inverted index for my web
pages at [http://canonical.org/~kragen/inverted-
index.html](http://canonical.org/~kragen/inverted-index.html). I just used
regular HTML <table>s for the representation, trusting to Content-Encoding:
gzip to squeeze out the resulting redundancy, and the index is split into many
different files so that the client can snarf just the part of the index that
interests it. Building the index (over my smallish set of pages) was
surprisingly easy, although I don't think I ever implemented the actual
client-side JS search engine.

The same year, I also wrote a search engine using a kind of structure similar
to a Bloom filter per post, similar to the one Stavros describes, but
transposed and encoded for better locality of reference; source code is in
[https://www.mail-archive.com/kragen-
hacks@canonical.org/msg0...](https://www.mail-archive.com/kragen-
hacks@canonical.org/msg00134.html) and the algorithm is outlined in
[https://www.mail-archive.com/kragen-
tol@canonical.org/msg001...](https://www.mail-archive.com/kragen-
tol@canonical.org/msg00155.html).

Three weeks ago, I wrote
[https://gitlab.com/kragen/derctuo/-/blob/master/recursive-
bl...](https://gitlab.com/kragen/derctuo/-/blob/master/recursive-bloom-
filters.md) describing a data structure, not yet tested, for using a bunch of
Bloom filters for indexing large data sets. This structure seems like it might
be capable of competing with traditional database sorted-column indexes for
ad-hoc query evaluation, perhaps even exceeding their performance on some
queries, though at a larger space cost. I hadn't thought about combining that
with my 2006 client-side inverted-index search approach, but it's designed to
have reasonable locality of reference, so it might work.

Related comment thread:
[https://news.ycombinator.com/item?id=23473303](https://news.ycombinator.com/item?id=23473303)

— ⁂ —

Bloom filters are a reasonable thing to try when looking for the "frequency-
domain representation of thinking", but I've long thought Pentti Kanerva's
Fully Distributed Representation was interesting, in which each atom is
encoded by a large random bitvector, attribute-value pairs are encoded by XORs
of the atomic bitvectors, and aggregates of multiple attributes are encoded as
the bitwise majority rule (with ties broken at random) of all their attribute-
value pairs.

You can recover the value of an attribute from an aggregate by XORing the
aggregate with the attribute's atom and examining its Hamming distance from
all the candidate atoms, a process which Kanerva originally suggested you
could train a neural network to be able to do in better than linear time. The
atoms actually stored for that attribute will have a Hamming distance many,
many standard deviations from 50%, while the false matches will all be within,
say, six standard deviations.

~~~
memexy
Sounds interesting. Any links and references for "Fully Distributed
Representation"? This is what I found when I Googled the quoted phrase:
[http://www.cap-lore.com/RWC97-kanerva.pdf](http://www.cap-
lore.com/RWC97-kanerva.pdf).

Skimming the paper this reminds me of word2vec and similar techniques for
embedding words into vector spaces but instead of using XOR the embedding is
accomplished with another neural network. Here's a link to a good description
of the embedding process: [https://towardsdatascience.com/neural-network-
embeddings-exp...](https://towardsdatascience.com/neural-network-embeddings-
explained-4d028e6f0526).

~~~
kragen
Yup, that's the paper — I probably read it at the time from Norm's site, QEPD.
More recently Pentti's been working on sparse distributed representations,
which as I understand it greatly simplify the matching problem.

I think the latent-spaces approach of ANNs is pretty interesting but I don't
feel like I understand it well enough. It's a bit lower dimensional than
Kanerva's proposal, though.

------
umvi
Doesn't seem like it would be that hard... Couldn't the server just run grep
if a user searches? Grep could scan every static file in milliseconds

~~~
stevekemp
(If you go down this route you might prefer ag/the silver surfer/similar, as
being faster than grep.)

I've actually reported a couple of security bugs in servers in the past, which
were caused by people running grep for searches.

Imagine you have a tree of web-content at /var/www, when a user searches for
"foo" you run "grep -r -h foo /var/www", then you parse out the resulting
filenames.

Now imagine what happens if your user searches for "foo /tmp; rm -rf /".
Whenever you pass user-input to a command-line application you need to be damn
sure you know what you're doing - if the shell is involved in parsing your
command you'll have problems.

~~~
kragen
This is a good point. Even if you don't have a shell injection vulnerability,
you may end up with a path traversal vulnerability. In this particular case I
think you're probably pretty safe as long as you're careful to only pass a
single argument to the grep process — but it depends on how exactly grep
parses its arguments. If there's a way to sneak ../ in there, you might get
hurt.

Bleh. I can see why people install WAFs.

~~~
StavrosK
I've lately become allergic to things that I can only partially reason about.
Sure, I'm probably going to be fine, but I can't _guarantee_ I'll be fine, and
the blackhats are always more knowledgeable than me about obscure bits of
shell/Unix behavior, so I just shy away and use a less free-form approach.

------
ga6840
Microsoft had a SIGIR best paper on using Bloom filter for searching.

