
Writing a full-text search engine using Bloom filters - StavrosK
http://www.stavros.io/posts/bloom-filter-search-engine/?print
======
zxcvgm
I have not yet had the chance to try it out but the acrylamid static blog
generator written in Python ships with its own JS search [0] using compressed
suffix trees. It's probably inspired by Sphinx, the Python doc generator,
which has a similar feature [1].

For static blogs, I think this could be used as a first level of search,
instead of directly going to Google first.

[0] [http://posativ.org/acrylamid/static-
search.html](http://posativ.org/acrylamid/static-search.html)

[1] [http://stackoverflow.com/questions/605888/whats-the-
search-e...](http://stackoverflow.com/questions/605888/whats-the-search-
engine-used-in-the-new-python-documentation)

------
stormbrew
An alternative, but related, approach that would eliminate false positives
could be to do an N-ary tree instead, where the bloom filter in this solution
is just the first level of the tree. You could then download a json document
for the child of the matching top level element to narrow it down to specific
documents. You could also calculate relevancies this way.

~~~
ris
I'm wondering why not build a tree of bloom filters with a bit array at each
level being the bitwise OR of its child nodes. Hashing only needs be done
once, and a comparison happens at each level of the tree, which can be
traversed to easily find matches whilst discarding branches that are certain
to be fruitless.

A bit of googling shows me this already exists and is called a bloofi tree.

~~~
yxhuvud
Then you are better at googling than I am. Can I enquire for a link?

Edit: nm, it appeared when I searched for bloom trees instead of bloofi

------
code_duck
This is probably an improvement over the simplest idea I can think of:

    
    
        <?php exec("grep $terms *html", $results);

~~~
al2o3cr
The motto of PHP devs is apparently "F __KIN ESCAPING, HOW DOES IT WORK? ".
Use escapeshellarg, kids.

~~~
mcguire
escapeshellarg misses some cases. You need to use escapeshellarg_fixed.

~~~
jrockway
real_escapeshellarg_fixed, you mean.

------
deckiedan
Nice idea :-)

I once wrote a python library for wrapping static content into a sqlite
database for full text searching, for roughly the same use-case (static
content, but I would like a search system). I just threw it online, if anyone
is interested.

[https://github.com/danthedeckie/pagestore](https://github.com/danthedeckie/pagestore)

------
posativ
You are not doing full-text search with bloom filters (unless you add all
partial words to the filter), only word-search, which is imho the biggest
disadvantage (e.g. typos in the index or on the client's side).

------
vdm
This reminds me of how Datomic clients download read-only segments and process
queries locally.
[http://docs.datomic.com/architecture.html](http://docs.datomic.com/architecture.html)

