

Searching with Riak Search  - pharkmillups
http://blog.inagist.com/searching-with-riaksearch

======
bravura
Could someone clarify on the RiakSearch scoring function for evaluating
retrieved results?

It appears that RiakSearch is modeled after Lucene in a variety of ways
(<https://wiki.basho.com/display/RIAK/Riak+Search>):

"At index time, Riak Search tokenizes a document into an inverted index using
standard Lucene Analyzers. (For improved performance, the team re-implemented
some of these in Erlang to reduce hops between Erlang and Java.)"

"Search queries use the same syntax as Lucene, and support most Lucene
operators including term searches, field searches, boolean operators,
grouping, lexicographical range queries, and wildcards (at the end of a word
only)."

However, there is difference in the scoring function
(<https://wiki.basho.com/display/RIAK/Riak+Search+-+Querying>):

"Documents are scored using roughly the same formulas described here:

[http://lucene.apache.org/java/3_0_2/api/core/org/apache/luce...](http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html)

The key difference is in how Riak Search calculates the Inverse Document
Frequency. The equations described on the /Similarity/ page require knowledge
of the total number of documents in a collection. Riak Search does not
maintain this information for a collection, so instead uses the count of the
total number of documents associated with each term in the query."

I am confused by this statement that they don't know "the total number of
documents in a collection".

If they were to say: "We don't use the document frequency (# of documents
containing this term / total # documents) because we cannot compute (# of
documents containing this term) over the entire corpus. Instead we estimate
the document frequency using the term frequency (# of occurrences of this term
in the corpus / total # terms)." that might be sensible.

But I am not really clear what the current scoring function is.

------
boundlessdreamz
If search performance and accuracy is the criteria of choosing the data store,
how does riak+riak search compare with mysql+sphinx ?

~~~
rb2k_
It is WAY easier to scale (up AND down) over several nodes. So if you've got
BIG amounts of text that you have to fulltext-search, Riak might be the better
option.

~~~
bravura
And, for good measure, could you compare RiakSearch to horizontally scaling
Lucene?

And ElasticSearch, if you are familiar with it?

~~~
samratjp
IMO, this is one of the better discussed comparisons on the whole Lucene
sharding business:

[http://mail-archives.apache.org/mod_mbox/hbase-user/201006.m...](http://mail-
archives.apache.org/mod_mbox/hbase-
user/201006.mbox/%3C149150.78881.qm@web50304.mail.re2.yahoo.com%3E)

EDIT: Also, keep an eye on Twitter's Lucene branch -
[http://engineering.twitter.com/2010/10/twitters-new-
search-a...](http://engineering.twitter.com/2010/10/twitters-new-search-
architecture.html)

