
Lucene: The Good Parts - pixelmonkey
http://blog.parsely.com/post/1691/lucene/
======
MichaelGG
Lucene is quite fantastic and Elasticsearch makes it a joy to use.

Still, I wonder what the overhead of Java is adding in this case. Even minor
things like integer decoding can be done very fast with SIMD... but such
approaches don't seem amenable to Java. I see that Elasticsearch exposes quite
a bit of GC metrics, which must be a problem at times. And one of the Lucene
devs wrote a post on how he replaced some parts with C++ and saw massive gains
(but with a disclaimer that this was in no way indicative that Java wasn't
fast).

I've considered trying to implement something like Lucene in, say, Rust, but
then I see just how utterly massive Lucene is. Just the fuzzy search part
alone required implementing code to generate code from a Russian PhD thesis
they didn't fully understand.[1] So, no matter how many cycles the JVM is
needlessly burning, Lucene just seems to advanced to write it without the
overhead. (And maybe my intuition is just wrong and the overhead is only a few
percent.)

1: [http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-
is...](http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-
faster.html)

~~~
vdfs
There is a port of Lucene to C++, CLucene[1], it's compatible with version 2.3
of Java Lucene, the project is stopped long time ago, but it's very much
stable, and works perfectly. An other port which is compatible with version 3
of java Lucene is LucenePlusPlus, but it use a lot of boost's smart pointers,
the port seems like t was automated. This port was why CLucene development
stopped, the maintainers wanted to make this new port faster by not using
smart pointers whenever possible, but that didn't happen.

1:
[http://sourceforge.net/projects/clucene](http://sourceforge.net/projects/clucene)
2:
[https://github.com/luceneplusplus/LucenePlusPlus](https://github.com/luceneplusplus/LucenePlusPlus)

~~~
MichaelGG
Oddly enough, I don't see anyone talking about benchmarks for those projects.
I found one offhand comment saying it was 2-3 faster than Java for indexing,
but only 10% better for search. No real benchmarks or such. I suppose that's
not the only reason to want a non-JVM version but it seems like a pretty major
reason and something that'd warrant headline treatment in the readme...

------
bkanber
Great article. I've rolled my own full-text search engines in the past and
it's a category of problems that I love, but even I have to admit that I'm
often astounded by Lucene's performance. The inverted index really lets you
stretch commodity hardware into pretty huge use-cases.

If you've never used ElasticSearch, I should note that that's one of ES's many
strengths -- it takes advantage of Lucene and makes deployments on commodity
hardware work really well. An ES cluster on five small EC2 instances can
handle a tremendous workload.

There is one thing about ES/Lucene that bugs me though... in the 3+ years I've
been running it in production, I still haven't been able to solve the "every
once in a while java utilizes 100% CPU until you restart the service" issue. I
suspect it has to do with Lucene's index merge operation, but no amount of
tinkering has solved the problem.

~~~
_wmd
One of the boons of Java is its remote debugging support.. you can attach a
profiler to a process when something like this happens, extract thread names &
stacks, and so on.

AFAIK you can also use the Linux 'perf trace' command on a Java process, but
probably there is some more setup involved.

------
frik
Most devs don't know that there is CLucene, a C++ port of the Java based
Lucene. It's lacking devs, so it's some versions behind. An alternative to
CLucene and also native C++ is Sphinx search (similar to CLucene what Nginx is
to Apache). Also SQLite has an official full text search addon named FTS4.

Lucene in Action 1st and 2nd edition are great books, I have them both. The
first edition was like the missing manual and it covers the API of the Lucene
1.4 with its rather rough API objects. Lucene 2.x+ API improved a lot.

~~~
vdfs
there is also LucenePlusPlus:
[https://github.com/luceneplusplus/LucenePlusPlus](https://github.com/luceneplusplus/LucenePlusPlus)

