

Open Source Search with Lucene & Solr - igrigorik
http://www.igvita.com/2010/10/22/open-source-search-with-lucene-solr/

======
fizx
For anyone who would like to take Solr for a spin, I invite you to check out
nzadrozny's and my startup: <http://websolr.com/>

We are a bootstrapped startup providing managed Solr hosting in the cloud
(currently EC2). We're all about making the operational side of high
performance Solr hosting as one-click easy as possible, so developers can
focus their time on doing cool stuff with it.

We love HN and are frequent commenters/lurkers around here, so we made a
"HN10" coupon which you can use on signup to get a month of our Silver plan
for free.

~~~
thorax
I really like the idea of this service. The difficulty is, I'm not seeing any
"Getting Started with Websolr" guide to understand how difficult it is to get
working with you. Where would that be?

In my ideal world you would have a demo instance or two where we could
connect/query arbitrary test data to understand performance/behavior/etc
before we signed-up to host real data there.

~~~
nzadrozny
Yeah, great points. Thanks for your feedback! Better general documentation is
pretty high on our list right now.

To answer your immediate question: we started as a Heroku add-on, so you might
take a glance at our documentation there (<http://docs.heroku.com/websolr>).
It's targeted at Rails applications using Sunspot, so ymmv. We're working on
creating and compiling similar guides for other platforms as well.

Seems like it's high time for us to do a "review my startup" post… ;)

------
evilhackerdude
Riak Search has been released recently. It’s got Lucene and part of the Solr
HTTP API built-in.

Basically you push json/xml/whatever documents into buckets. The docs will be
indexed, i.e., by field names (json & xml) or simply fulltext. It is pretty
cool because it’s based on Riak Core and thus has the same benefits as Riak
K/V. Lucene runs transparently in the background - afaik you never even have
to touch it.

Read more in their wiki: <https://wiki.basho.com/display/RIAK/Riak+Search>

Especially:
[https://wiki.basho.com/display/RIAK/Riak+Search+-+Indexing+a...](https://wiki.basho.com/display/RIAK/Riak+Search+-+Indexing+and+Querying+Riak+KV+Data)

------
ankimal
We use an Enterprise Search Platform (our biggest software acquisition) minus
the support (another dumb idea). The entire thing is like a Black Box. It
takes days to figure out what "Error: FS error" actually means. For a new
project, we used Solr to maintain a smaller index and have never looked back
since. Anybody about to start building a search index, Lucene/Solr is the way
to go.

~~~
storm
I've been using Solr for some pretty heavy lifting, and it's incredibly
impressive. Rock solid, extremely advanced analysis and search capabilities,
and the performance is amazing if it's on suitable gear. Time invested in
learning it pays off big.

I'm familiar with the enterprise black boxes you're talking about - I probably
know the specific one you're tormented by. I've seen the licensing fees alone
lead large companies to drop rows from their front-end stores to avoid going
into a new pricing tier (takes balls of steel to charge by the record, I must
say), and I've seen competitors fold at least in part due to the expense of
paying for the thing.

A lot of startup folks getting excited about NoSQL seem to have passed over
Lucene/Solr completely, and I think it's worthy of much more consideration
than it gets. It's mature, it's _fast_ , and the people working on it live and
breathe the problem space.

There are undoubtedly devs out there badly needing powerful analysis and
search to execute on their vision, but who will end up suffering with half-
baked solutions for lack of even _hearing_ about Solr, much less giving it a
try.

~~~
ankimal
I feel another issue is that management sometimes feels that paying big bucks
means your rear end is covered. It takes a lot to convince them that this is
free and works great at the same time. Whats more, the community is great!

------
dangrover
Haystack for Django is a really nice way to integrate with these systems. You
can use lucene, solr, or whoosh as backends for your search.

~~~
nzadrozny
Sunspot for Ruby is another good Solr client that's popular with Rails
applications.

<http://github.com/outoftime/sunspot/>

While Solr's API is pretty easy to work with directly, there's definitely
something to be said for using a quality client for your platform.

------
akozak
At Creative Commons we use Lucene/Nutch for our educational search prototype
DiscoverEd: <http://wiki.creativecommons.org/DiscoverEd>

It was easy enough to add in our special sauce like a triple-store for
consuming and displaying semantic data (I guess I can say easy since I didn't
do it myself).

~~~
sdesol
I would say it's pretty easy if you are technically inclined. When I
implemented the first iteration of my text search engine using Lucene, I
didn't even know Java but I was able to write my own custom tokenizer and get
it to index and retrieve results from the index in about 6 hours.

I highly recommend you get the book "Lucene in action" as it gives solid
examples that you can build upon.

------
nkurz
I'm a fan and contributor to Lucy, which is mentioned briefly in the header:
<http://incubator.apache.org/lucy/>

While Lucy did start out as a C port of Lucene (hence the name), it's since
broken any attempts at Lucene compatibility. Instead, it's aiming to be a fast
and flexible standalone C core with bindings to higher level languages. Since
it's growing out of Kinosearch, it's best developed bindings are in Perl, but
support for all the usual suspects (Python, Ruby, etc.) is planned.

Technically, the main difference from Lucene is that it gets cozier with the
machine: the OS is our VM. It's mostly mmap() IO, and we're very conscious of
paging and cache issues. While we're trying to maintain 32-bit back
compatibility, we take full advantage of 64-bit solutions when they offer
themselves. The scripted bindings are also very cool --- you can do things
like make callbacks to scoring methods in your script language to truly
customize your results.

If for some reason you're not finding what you need in Lucene and Solr, check
it out. We just became a full Apache incubator project, and are eager to get
more developers involved. You'll find clean C code, decent documentation, and
a low traffic but very responsive list. If you're using Perl, C or C++, you'll
get a great product from the start. If you're using anything else, you'll have
to help a lot on the bindings, but I think you'll be quite pleased with the
end result.

------
spoondan
Lucene is great but I wish schemas were an optional part of Solr. They add
complexity and take away flexibility. If you have a photo database where you
want searchable metadata describing the subject of the photographs, you can do
this easily and naturally in Lucene. But Solr requires you either (1)
prefigure available metadata or (2) expose field typing details to your users
(so a field for birthday is actually "birthday_d", with the "_d" indicating
it's a date). Both of these are very unattractive to me.

The worst part is that I have no idea what benefits schemas are supposed to
bring me. The documentation vaguely promises that schemas "can drive more
intelligent processing", but I have a feeling I could get that more easily
without schemas. It also tells me that "explicit types eliminate the need for
guessing of types," but only, apparently, by requiring users to _understand
and remember_ them.

~~~
storm
Schemas are an optional part of Solr. Pretty sure that the default schema.xml
has an example of a catch-all field definition, if you use that it will
automatically deal with any key you want to throw at it.

Of course you need to specify one field type (analysis stack) to apply to all,
but I don't know how you expect to avoid that - gonna have to express that
metadata _somewhere_ if you need more complex behavior.

Personally I think the _d, _i approach is ok, suffixes aside - complex field
analysis options w/o a schema.

------
cowmixtoo
So has anyone used this combination for realtime and historical log searching
(like what Splunk offers)?

~~~
igrigorik
Yep, take a look at loggly.com - AFAIK, a bunch of ex-Splunk guys. They're
building their system on EC2 + SolrCloud.

~~~
bobf
+1 for loggly -- check out logstash <http://code.google.com/p/logstash/>

~~~
kordless
Be sure to check out Jordan Sissel's Grok as well:
<http://code.google.com/p/semicomplete/wiki/Grok>. It's a field extractor.

~~~
bobf
Definitely. Just about anything Jordan makes is probably worth checking out,
actually.

------
reinhardt
Any experience on how Lucene/Solr stacks up against other search tools such as
Sphinx or Xapian ?

~~~
gtani
Not sure if you're asking about indexing speed/size, precision/recall and the
2 or 3 dozen config options (separator/tokenizers/analyzers, stopword, index
to ASCII or Latin-1, AND/OR search terms), etc.

What I recommend for precision/recall /config options is that your platform
(rails, django, java, PHP) probably has plugin for SOLR and sphinx. Set up 2-4
indexes using the config options that matter most to you (for me they're AND-
OR of search terms, and stopwords, which i use in lists of 0, 50, 100, 150).
Then do a (sort of) A-B test where you see which records one index picks up
that the other misses. (Most people recommend not using any stopwords if
you're only using one index, but i never got decent results using only one
index)

P.S. Solr is the 800-pound gorilla, has the terrific Manning book, zillions of
docs, etc. Sphinx probably covers most people's needs config-option wise(at
least for European languages) lightning fast to index, and runs in 256M VPS,
no tomcat/jetty.

------
known
I prefer <http://aspseek.org>

