

Getting 100x better search performance on Riak - gwf
http://blog.clipboard.com/2012/03/18/0-Milking-Performance-From-Riak-Search

======
gfodor
The deficiency with AND queries due to term based partitioning reminds me of a
hack I did a few years ago trying to build a basic IR system on simpledb:

[http://codingthriller.blogspot.com/2008/04/simpledb-full-
tex...](http://codingthriller.blogspot.com/2008/04/simpledb-full-text-search-
or-how-to.html?m=1)

Of course, I was backed into this corner by the various constraints at the
time, and this was just a fun hack. It sounds a bit precarious to consider
using Riak for text search, at least compared to something like Solr, if it
truly has issues with the basic bread and butter of search like sorting by
recency and conjunctions. (of course I am only going on what is in this post,
I have no experience with Riak)

~~~
gwf
It really isn't all that bad. When you compare what we had to do compared to
keeping two big data systems in sync (instead of Riak, which is a breeze), I
think we got off easy.

------
pkulak
Mecha is being built to directly solve these exact problems:

<http://vimeo.com/30396564>

------
tpsreport
What's the baseline? When the baseline is terrible, getting 100x performance
is not a big deal.

No good engineer would cite "100X improvement" without quantifying the
baseline level. This is PR, not engineering and not science.

~~~
gwf
The baseline was a substantial test corpus that we scaled several orders of
magnitude over a series of runs, all meant to simulate typical clip size with
typical word frequencies. The 10-100x gains had two contributions that each
was in the 3-10x range. We tested it against the standard search which
incorrectly performed pagination because of the misorder on sort versus slice.
We also tested in on two types of map/reduce jobs that correctly implemented
the sort and slice (and had been in production).

Ideally, we would have kept the data around to give a fuller a report. But the
truth is we did this over 9 months ago and didn't save the data. After
informally sharing the impact with a lot of people, we heard a lot of
encouragement to share the techniques.

And you're right, this is not hard core science nor engineering. But it is a
good tip, which you can take or leave.

~~~
tpsreport
I'm still not seeing you report something that says "an operation that took X
ms now takes Y ms." Note the absolute numbers whose units are in seconds.
That's what I'd like to see for a baseline.

And I won't even comment on the "we did this 9 months ago and did not save the
data" part. That kind of stuff would not fly in medicine or science or most
traditional engineering fields. Why should you be exempt?

~~~
gwf
To be clear, operations in our test corpus that took X ms on average took 100X
ms. Is that the statement you are looking for?

One thing that I think you're missing is that while we experienced a 100x gain
for our application, our findings aren't strictly empirical in the sense that
the gain is always 100x. In many applications it will be more. The insight in
the blog post is analytical, not empirical. Specifically, most people (I
think) would assume a text index to be document partitioned. In Riak it's not.
On a fundamental level this means that AND operations take time
O(max(|A|,|B|)) instead of O(min(|A|,|B|) where |A| and |B| are the sizes of
results that match query A and B, respectively. If you pick words at random
from a power law distribution (which is typical in all natural languages) you
will more often than not see many orders of magnitude difference in the |A|
and |B|. If you sample queries from real query streams, you will see the same.
For some intuition, just think about a query which has a common tag (like
"funny" used in the example) and something restricted (like a particular
user). The first will typically be an understood fraction of the size of your
corpus while the second will have size that is more like a constant.

Putting this all together, the savings that we find for moving the
intersection where we did will only grow on a relative basis as we get more
users and clips. This hack costs 2x the storage size but always give
O(min(|A|,|B|) complexity results instead of O(max(|A|,|B|)). 100x win is just
shorthand for saying that those two set sizes typically varied by two orders
of magnitude because of power law distributions. But in academia, we refer to
that as "really fucking big".

The "presort" option that we added will scale differently, but have equally
dramatic impact because pagination is effectively broken in Riak search if you
do not use relevance sort. In this case, the O(.) argument is murkier because
one version is done entirely within the index (which is usually in RAM) and
the other has to hit the disk. In formal engineering circles, we refer to this
type of boost as "unfucking real".

So, we could have measured things with more precision, but this was such an
obvious win that it was almost pointless to calibrate it further.

~~~
tpsreport
You have a knack for producing a lot of text with no supporting evidence.
Especially of a quantitative nature.

This might go over well with a certain class of people, but it offends my
engineering sensibilities.

~~~
Shalen
Hi TPSReport - You realize that "gwf" is the Gary William Flake who has
written an award winning complex book that is used in colleges worldwide as
well as has run R&D for many companies like Microsoft, Yahoo, and Overture
right? I'm sorry to say this but personally I'm finding your comments on this
post as well other posts negative and lacking value. It would be great and
helpful if you have any constructive/positive ideas, tips or experience to
share to add value to the community. I'm sharing this in a positive
constructive spirit. Thanks.

~~~
tpsreport
Let me reiterate my constructive suggestion that the OP provide actual data. I
can see how the author wrote a pop book, as he is quite verbose, but any
engineer can tell you that he is also quite evasive on technical questions.
"What's the baseline?" should not pose a tough question for anyone writing
about performance improvements.

------
joshu
Hi Gary!

~~~
gwf
Hey Josh!

~~~
joshu
Man, it's downvote central over here.

~~~
gwf
The hive mind is cute when it's temperamental.

~~~
jpeterson
Well, you guys could use email or IM for this trite stuff and reserve HN for
meaningful discussion.

