
Observation: Lucene rocks - henning
Two-word summary: Lucene rocks. Nine-word summary: It indexed 3 gigs of text in 20 minutes.<p>I've wanted to figure out Lucene but never got around to it (the Lucene book is very outdated and none of the example code works, for instance) but today I did something simpler, a little experiment in indexing.<p>I have a directory of about 3.2 GB of XML documents (medical journal papers downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz -- it's about a 700 MB file). I wondered how long it would take the simple disk-based Lucene demo using default settings (http://lucene.apache.org/java/2_3_1/demo.html).<p>System stats: 7200 RPM 300 GB disk; Windows XP SP 2, Quad Core 2.4 ghz Core 2, 2 GB DDR2-800 RAM.<p>It took 23 minutes, the last 5 of which were merely flattening index chunks into a single file so that searches run faster.<p>So about 20 minutes for 3 gigs of text. The final index file was about 1/5 the size of the original source text at 646 MB.<p>Memory usage was very reasonable - it hovered around 30-40 MB (unlike, say, Java IDEs which use up 200 MB or so).<p>Ultimately a benchmark like this is disk-bound, but that's still fast as shit in my opinion. I had to whip together a ghetto homegrown indexing system at work several months ago (I've never had time to optimize it), and this blows away what I created.
======
tlrobinson
Agreed. I was able to set up full text indexing/searching in a few hours.

The longest part of the process was trying to figure out which versions of
Java Lucene and Zend PHP Lucene were compatible. FYI:

 _Lucene 2.1 index format support (which is also used in Lucene 2.2) is
included in the current "trunk" branch. It is available via SVN in current
nightly snapshots.

We hope to include Lucene 2.1 index format support in ZF 1.5.0. The current
release (ZF V1.0.4) works with Lucene 1.9-2.0 index formats._

[http://framework.zend.com/manual/en/zend.search.lucene.html#...](http://framework.zend.com/manual/en/zend.search.lucene.html#ftn.id2740528)

------
henning
And the biggest difference between my ghetto system and Lucene is that
searches with lots of results are very, very fast.

------
thorax
I've heard good things about Lucene, but we use Sphinx:
<http://www.sphinxsearch.com/>

For our tests, it indexed much faster than the common Lucene implementations,
and for our needs was also a tad faster overall. I haven't tried the newest
version, though.

~~~
nickb
I don't know what kind of testing you've done but nothing even approaches the
speed of Lucene. It's by far the fastest open source search engine currently
available. If you're using Rails, I cannot recommend Solr enough. It's
amazing.

Cutting's a genius.

~~~
thorax
Do you have references for "nothing even approaches"? Specifically compared to
Sphinx? The only comparisons I've found are showing sphinx coming ahead in
many indexing/search cases (if only slightly). See my other comment on this
thread with links to benchmarks where sphinx clearly "comes close". We did a
good bit of research on this, so it does feel odd that you'd say "nothing even
approaches".

It was also ridiculously easy to get Sphinx up and going. Lucene is a killer
engine, no doubt, but Sphinx's ROI alone won us over.

------
azsromej
I second your observation, though my recent foray into Lucene was far simpler.
I used the RAMDirectory feature to build an index in memory for a large list
of names (and our queries go through a thick OR/M). The user of the
application needs to be able to filter the list by keywords and doing the
query each time was taking too long (2 or 3 seconds). It's now near
instantaneous.

I think for 10,000 documents (two fields: name and id) it takes 20 seconds to
build the index in Lucene .NET.

I had always heard of using Lucene for really large datasets and thought it
might be overkill for speeding up a somewhat small part of one application
dialog. In reality it took a single reference to the Lucene .NET dll and a few
functions to build the documents and add them to the index.

------
asjo
Has anyone compare Lucene to Xapian? I have never tried Lucene, but have been
very happy with Xapian.

<http://xapian.org/>

------
initself
Plucene - Perl port of Lucene

<http://search.cpan.org/~tmtm/Plucene-1.21/lib/Plucene.pm>

~~~
bluelu
Plucene is slow as hell.

You better use Kinoseach, which also uses the same index format as lucene.

Some benchmarks are on this site :
<http://marvinhumphrey.com/kinosearch/benchmarks.html>

------
chaostheory
lucene is pretty cool and it's a lot better than anything I've seen so far
(including ferret). The only problem I've experienced with it was index
corruption, which is fairly common and frustrating (though in fairness it
could have been due to my sys admin skills)

