

Greplin opensources Lucene Utils and Bloom Filters - smanek
http://tech.blog.greplin.com/lucene-utilities-and-bloom-filters

======
dabeeeenster
_A query that matches phrase prefixes - for example "Epic w" will match both
"Epic win" and "Epic wonder". This is particularly useful for implementing
Google Instant style searches_

This is _really_ useful and painfully lacking from Lucene. Great stuff.

~~~
sigil
I was curious how they implemented prefix matching, so I went and looked at
the code [1]. Unfortunately, this is just a simple linear scan that calls
.startswith(). It is possible to do fast (log N) prefix matching with radix /
critbit trees.

[1] [https://github.com/Greplin/greplin-lucene-
utils/blob/master/...](https://github.com/Greplin/greplin-lucene-
utils/blob/master/src/main/java/com/greplin/lucene/query/PhrasePrefixQuery.java)

~~~
nostrademons
Radix trees are O(k), where k is the length of the query string.

Anyway, for prefix matching, you want to take careful note of the size and
shape of your data, because it matters for which algorithm is fastest. If your
data all fits in RAM (or better yet, all fits in L2 cache), then I've had very
good results with binary search (O(log N)) to find the first matching result,
and then linear scan to find all possible suffixes. This is a lot more cache-
friendly than radix trees, which have better theoretical performance but often
touch memory that's all over the place.

~~~
sigil
> Radix trees are O(k)

Yes, thank you.

------
yid
I really don't mean to sound unpleasant, but does anyone else think that the
primary goal of Greplin is to exit via Google or Facebook? From a consumer
point of view, can it ever be anything more than a niche product? I'm
genuinely curious...these patches are somewhat superficial, and while a nice
gesture, the cynical part of me sees it as a ploy to gain cred with the people
most likely to influence buyouts.

~~~
zem
you say that like it's a bad thing. why is "1. build something that would be
useful to a large company, 2. prove by acquiring users that the product has
real value, 3. get acquired" a bad business model? from a consumer point of
view, they'll benefit when google or facebook or whoever integrates greplin's
techniques.

~~~
blhack
This is completely irrelevant to the discussion, please feel free to downvote
me for it:

2 things: Your account is exactly "1337" days old today, happy leet day :).
Also, are you the same zem that posted on newslily a while ago? If so, you
posted some really awesome stuff, thanks :)

(Sorry if I've recognized you and asked this question before, I have a
terrible memory)

~~~
zem
thanks, would never have noticed l337 day on my own :) and yeah, same zem!
dropped off newslily due to having too many social networks to keep up with;
nice to see it still going strong.

~~~
blhack
o/ well long-distance high five to you :).

I wouldn't really say it's still going _strong_ , haha. Unfortunately, we
never really got the traction on that that we needed to allow it to keep
running by itself (Cody and I were submitting a _lot_ of the content, which
was fine, but it would have been really awesome if we didn't have to).

<sarcasm>Big surprise there, though, we were trying to compete against HN and
reddit</sarcasm>

Back in November, we both started working a new project:
<http://thingist.com>, which has been taking a lot of my time lately, so I
haven't been submitting as much.

Anyway, good to see you around here, man :)

~~~
zem
where is the "help->about" page for thingist? i can't make out whether it's
competing with tumblr, twitter, or something new entirely.

------
surtyaar
There are some good opensource implementations in python (and I am sure other
languages).

<https://github.com/jaybaird/python-bloomfilter> \- offers scalable bloom
filters

<https://github.com/axiak/pybloomfiltermmap> \- uses mmap

~~~
smanek
We originally used mmap too - but it didn't work very well. First, Java has
some rather serious mmap limitations,
(<http://bugs.sun.com/view_bug.do?bug_id=4724038>) - and for some reason, we
saw occcasional data corruption (which we haven't seen since we moved to the
current system).

~~~
jwr
Have those mmap limitations really impacted you? And the corruption — I'm
assuming that was with read/write mappings that you wrote to?

I'm asking, because we're getting great mileage out of mmap in Clojure, albeit
for read-only mappings. And that's in a search engine :-) Using mmap for large
data is great, because you avoid enlarging your heap and the garbage collector
doesn't even have to care about your data.

------
chaostheory
my question is why they didn't use solr instead?

~~~
smanek
The first version of Greplin was actually built on Solr!

Eventually, we needed more flexibility than Solr easily offered though. For
example, we've added far more efficient sharding, document modifications
(updates and deletions), flushing, and near real time search than either
Lucene or Solr support out of the box (and they were much easier to add to
Lucene than Solr, since Lucene makes fewer assumptions about your
dataset/use).

I think Solr is a great tool if your needs happen to fit into their model -
but if they diverge a lot, it sometimes makes more sense to build your own
custom framework on top of Lucene.

~~~
joshhart
Did you consider the distributed, real-time, and faceted extensions to Lucene
we (LinkedIn) built? Sensei, Zoie, and Bobo respectively. Get em at
<http://sna-projects.com/sna/>

~~~
smanek
We're huge fans! Our real time search technology copied its architecture from
Zoie, and our faceting/caching took some ideas from Bobo!

We didn't use them outright since we have fairly different
requirements/constraints (our data has some pleasant properties that makes it
easier to shard and facet than the general case) and we wanted something a bit
simpler.

Shoot me an email sometime though (email in profile)! I'd love to buy you
lunch and pick your brain ;-) You guys clearly know what you're doing!

------
yuvadam
I know the verb "open-sources" is not well defined, but...

Publishing 2 github repositories, each with several classes which are mostly
trivial, is not "opensourcing".

Much like the fact that a weekend project is not a "startup".

~~~
fleetingthought
From Wikipedia: "The term open source describes practices in production and
development that promote access to the end product's source materials."

I'd imagine this falls in to that category. The takeaway for me is that the
code _is_ useful, regardless of the size, and they took the time to let other
folks enjoy it.

~~~
yuvadam
Don't get me wrong, I think highly of Greplin, and appreciate every line of
open code. The process itself it to be highly appraised.

But I've seen much larger open source patches - that are no less important -
that have received much less publicity that this.

~~~
budu3
They might be getting a disproportional amount of publicity compared to much
larger more complex projects, in your opinion, but that doesn't negate the
fact that it is still open source.

