
Full-Text Search in JavaScript - kiril-me
http://burakkanber.com/blog/machine-learning-full-text-search-in-javascript-relevance-scoring/
======
jka
There's a pretty neat little project called 'lunr.js' which can provide a
fairly fully-featured JavaScript search engine for use in the browser.

It supports multi-field search, stop-word removal, tf-idf - but no Okapi BM25
alas.

[http://lunrjs.com/](http://lunrjs.com/)

~~~
azdle
Theres also [http://elasticlunr.com/](http://elasticlunr.com/), I'm not sure
what algorithm it uses, but the site says "Elasticlunr.js use quite the same
scoring mechanism as Elasticsearch, and also this scoring mechanism is used by
lucene.", so maybe? I can't say I know that much about this topic...

I actually just used that library in creating a search for the docs site where
I work and I have to say it works really well, it's a fully static site
(hosted on github pages) and all the search is done in the browser based on an
json "index" file that I generate along side the rest of the site.
[http://docs.exosite.com/](http://docs.exosite.com/)

~~~
olivernn
I'm fairly sure elastic lunr is a fork of lunr, I still seem to have the most
commits even! [1]

I'll have to take a look and see what @weixsong added, perhaps there are some
changes that I can merge upstream.

[1]
[https://github.com/weixsong/elasticlunr.js/graphs/contributo...](https://github.com/weixsong/elasticlunr.js/graphs/contributors)

------
ninjakeyboard
I built similar tf-idf w/ cosine similarity search in scala if anyone is
curious [https://github.com/jasongoodwin/tfidf-
search](https://github.com/jasongoodwin/tfidf-search)

------
hellofunk
I think all the HN love is giving this web server a rough time.

~~~
JonnieCache
[https://archive.is/EU35A](https://archive.is/EU35A)

------
stephanheijl
It looks pretty cool, but it slowed my browser down to a crawl when I had it
open in the background. That seems like something that needs optimizing.

------
meeper16
An approach more along the lines of machine learning would be to use what
word2vec originated from at Berkeley Lab [https://www.kaggle.com/c/word2vec-
nlp-tutorial/forums/t/1234...](https://www.kaggle.com/c/word2vec-nlp-
tutorial/forums/t/12349/word2vec-is-based-on-an-approach-from-lawrence-
berkeley-national-lab)

------
andor
I'd call that Information Retrieval, not Machine Learning. Indexing documents
doesn't "learn" any more than a file system storing files, and TF-IDF and BM25
are simply weighting functions.

But of course, it's an interesting topic. If you want to learn more, the first
edition of the most widely used textbook is free:

[http://www-nlp.stanford.edu/IR-book/](http://www-nlp.stanford.edu/IR-book/)

~~~
draker
Novice question: Is the "www-" just part of the given subdomain or is there
something else that occurs using that syntax?

"[http://nlp.stanford.edu/IR-book/"](http://nlp.stanford.edu/IR-book/")
resolves to the same resource.

~~~
quesera
No magic. "www-nlp" and "www" are explicitly configured to point to the same
IP address in the DNS.

Then (in the simplest case) the responding webserver is configured to treat
the two hostnames (and possibly others) as identical and serve the same files.

So I guess there is magic. CNAMEs and virtual hosts and HTTP1.1.

