
Add search to your Jekyll blog - jameszhang
http://jzhang.io/add-search-to-jekyll
======
swanson
Here's another approach: Make a templated JSON file that mimics a database-
backend. When you generate the site (or just push to a GitHub Pages repo), the
JSON file will be populated with the "search index" of your site. Something
like this:
[https://github.com/swanson/lagom/blob/master/site.json](https://github.com/swanson/lagom/blob/master/site.json)

Then you can use whatever Javascript you'd like to search through your pages
by pulling down the JSON file into memory and slicing and dicing to your
heart's content.

~~~
derefr
This is really an approach that should get more attention.

People know all about database-backed websites (with an AJAX/REST server), and
static-generated sites (with no server at all.)

But there's a middle-ground where you statically generate precomputed
AJAX/REST _responses_ to any query you could make, shove them in an S3 bucket
and maybe put a CDN (e.g. Cloudflare) in front of it, and then treat _that_ as
your "server", building a traditional AJAXy frontend to talk to it. Scales
nigh-on infinitely.

The best part is, it's not necessarily read-only! You can write a "real"
server, too, that just handles updates. When it receives a request, GET the
old version of the object it's modifying from S3 (no need for a database!),
patch it up with the AJAXed-in data, and PUT it back. (And follow it up with a
CDN single-file-purge API call, if it's relevant.)

------
StavrosK
This got me thinking: The obvious way to do full-text search in the post body
would be to build an index of {"word": [post_id, post_id, post_id]} of posts
it appears in. However, this could be huge.

Does anyone know if there's a technique that uses a list of posts and a bloom
filter that contains all the words in that post? I.e. you iterate over all the
posts and check the bloom filter for membership of all the terms.

Since the number of words is (probably) much greater than the number of posts,
you just need to loop a few hundred times at most, and you gain _a lot_ in
saved space. Also, since a Bloom filter has no false negatives, you are, at
least, guaranteed to find all the posts that mention the specified words (with
maybe a few "junk" ones in between, but which should be easy for the reader to
filter out).

You can't do weighting with this technique, but it should at least be a quick
way to figure out which post IDs you want to show.

Does anything like this exist currently?

EDIT: Here's a quick proof of concept:
[http://nbviewer.ipython.org/gist/skorokithakis/d115ab734d9ad...](http://nbviewer.ipython.org/gist/skorokithakis/d115ab734d9adbcf306f)

It works fine, but the filters are a bit large (2 KB each), so I'm not sure
how much space you save.

EDIT 2: This was so much fun that I wrote it up:
[http://www.stavros.io/posts/bloom-filter-search-
engine/](http://www.stavros.io/posts/bloom-filter-search-engine/)

~~~
hedgehog
Very nice. You will save some space and allow for some typos if you stem and
soundex before insertion. Also you can save space and improve the run time
somewhat if rather than many separate bloom filters you build one large one
where each item is post ID + word. If you do that you can also insert each
word bare so you get O(1) empty result sets, helpful if you're updating the
results with every keystroke in a search box.

~~~
StavrosK
Huh, very nice idea! That should, indeed save a ton of space and be much
simpler when searching! I'll try that now, thank you.

EDIT: Hmm, turns out it's pretty much the same size, which makes sense, I
guess:
[http://nbviewer.ipython.org/gist/skorokithakis/0abbfebced25f...](http://nbviewer.ipython.org/gist/skorokithakis/0abbfebced25fd4b2ed3)

~~~
hedgehog
The space savings for the same error rate should be small (I think the
likelihood of false positives for a given load goes down slightly with size of
the filter) but the benefit in lookup time should be significant for multiword
searches. Thinking about it more though if you're doing live search you'll
already have computed the results for the first word by the time you are given
a second so maybe it doesn't matter.

~~~
StavrosK
I think it'll be faster to do multiple filters, because the one-filter way
requires hashing and comparing N times while the multiple-filter way requires
hashing once and comparing N times.

~~~
hedgehog
Oops, you are correct.

------
suter
We implemented search on our help site
([http://help.simpletax.ca](http://help.simpletax.ca)) using the excellent
lunr.js ([http://lunrjs.com](http://lunrjs.com)) + this Jekyll plug-in
([https://github.com/slashdotdash/jekyll-lunr-js-
search](https://github.com/slashdotdash/jekyll-lunr-js-search)). The search
index is updated when you build your Jekyll site, so it's a piece of cake to
maintain and gives you full text search. I don't know how well it would scale,
but if you are in the hundreds of posts, you should be ok.

~~~
jameszhang
That's awesome, especially for a help site. My company does the same thing
with our documentations.

------
paukiatwee
My Jekyll blog using lunr js for fulltext search with JSON file as backend.
Check it out here. [http://dreamand.me/web/fulltext-search-at-jekyll-
site/](http://dreamand.me/web/fulltext-search-at-jekyll-site/) and lunr js
[https://github.com/olivernn/lunr.js](https://github.com/olivernn/lunr.js)

~~~
jameszhang
That's neat, but is something wrong? I get this error on your site:
[https://cloudup.com/cti8Egu8fQS](https://cloudup.com/cti8Egu8fQS)

------
shriphani
How about using a lightweight TFIDF engine like whistlepig
[[https://github.com/wmorgan/whistlepig](https://github.com/wmorgan/whistlepig)].
It is implemented in ANSI C.

I had no issues writing a racket wrapper using the FFI for it:
[http://blog.shriphani.com/2013/08/27/racket-whistlepig-
bindi...](http://blog.shriphani.com/2013/08/27/racket-whistlepig-bindings/)

------
captn3m0
Even though its a nice hack, its more of a css based filtering which can be
used irrespective of Jekyll. A jekyll specific search should include searching
articles. I've been wanting to build something similar using the excellent
lunr.js. The idea would be to rebuild the index before each push to the repo,
and a special page (search.html) that loads up the index via Ajax and shows
the results as you type.

Still, cool hack.

------
rgrieselhuber
Swiftype?

