Hacker News new | comments | show | ask | jobs | submit login
Lucene: The Good Parts (parsely.com)
172 points by pixelmonkey on Mar 13, 2015 | hide | past | web | favorite | 16 comments

Lucene is quite fantastic and Elasticsearch makes it a joy to use.

Still, I wonder what the overhead of Java is adding in this case. Even minor things like integer decoding can be done very fast with SIMD... but such approaches don't seem amenable to Java. I see that Elasticsearch exposes quite a bit of GC metrics, which must be a problem at times. And one of the Lucene devs wrote a post on how he replaced some parts with C++ and saw massive gains (but with a disclaimer that this was in no way indicative that Java wasn't fast).

I've considered trying to implement something like Lucene in, say, Rust, but then I see just how utterly massive Lucene is. Just the fuzzy search part alone required implementing code to generate code from a Russian PhD thesis they didn't fully understand.[1] So, no matter how many cycles the JVM is needlessly burning, Lucene just seems to advanced to write it without the overhead. (And maybe my intuition is just wrong and the overhead is only a few percent.)

1: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is...

There is a port of Lucene to C++, CLucene[1], it's compatible with version 2.3 of Java Lucene, the project is stopped long time ago, but it's very much stable, and works perfectly. An other port which is compatible with version 3 of java Lucene is LucenePlusPlus, but it use a lot of boost's smart pointers, the port seems like t was automated. This port was why CLucene development stopped, the maintainers wanted to make this new port faster by not using smart pointers whenever possible, but that didn't happen.

1: http://sourceforge.net/projects/clucene 2: https://github.com/luceneplusplus/LucenePlusPlus

Oddly enough, I don't see anyone talking about benchmarks for those projects. I found one offhand comment saying it was 2-3 faster than Java for indexing, but only 10% better for search. No real benchmarks or such. I suppose that's not the only reason to want a non-JVM version but it seems like a pretty major reason and something that'd warrant headline treatment in the readme...

Java overhead is not a huge issue unless you are embedding Lucene into some low spec devices. A consumer search system is usually relying heavily on cache (just like any databases), so even a 30-50% latency hit on cold queries is not that big a deal if > 90% of your queries are served from cache.

GC is a big problem when you don't know the expected query distribution which is the case for Elasticsearch's analytics. There is a lot more to a search engine than packing, decoding and merging posting lists. I've never seen anything that compares with Lucene text analysis and scoring API supports.

That fuzzy search story is more worrying than anything else, really. What they did seems just crazy to me.

First, the paper really doesn't seem so difficult. Second, they don't even think about reaching out to the author/s?

I suppose I shouldn't talk until I've tried their task. But I've implemented a lot of algorithms from papers, and their story had me shaking my head.

I've also implemented a lot of algorithms from papers, and their story has me nodding my head in agreement. Some algorithms papers are just downright impenetrable.

I tried implementing the same paper's algorithm in the past and somewhat succeeded - but gave up in the end. Precumputing the automatons was slow as hell, I came to a similar conclusion as the authors (N > 2 isn't really feasible, but was something I was interested in) and my plumbing sucked.

I'm really not experienced reading papers and this was the only one I ever tried, so I cannot compare it to others. It certainly was quite hard to follow for me and took some month of nightly dabbling before I reached the point above.

Is there a self-contained alternative to ElasticSearch specifically? If there was one written in Go or otherwise statically linkable that would be great from a deployment standpoint. I could deal with somewhat worse performance in exchange for that.

Search for "golang full text search database".

Lucene & Hadoop meant a big push for the Java eco-system, it's like a lock-in. Native C++ libraries and other free text search implementations have a smaller community and are usually less known. With Go, C++11 and Rust the future looks bright but it will take some time to catch up.

Agree, it's early days for non-Java based alternatives.

One of my colleagues, Marty Schoch, has been working on a full text search engine in golang, called bleve [1]

1: http://www.blevesearch.com/

>Search "golang full text search database".

I have. The problem is picking one that is mature enough and will be supported for years as you can expect ElasticSearch to be, which is what I meant by "an alternative".

I agree with you about the future looking bright but I meant something you could use right now.

It's hard to say, for Go there is e.g. bleve FTS and there are ports of Java Lucene to Go (e.g. https://github.com/balzaczyy/golucene). Such ports are either semi-automatic or automatic, only automatic ports. It's hard for Lucene ports to keep up, as Lucene is moving fast and most ports stalled.

One could also use a service oriented architecture and use e.g. ElasticSearch Rest API or C++ based Sphinx Search, both need litttle configuration and no custom code.

Most devs don't know that there is CLucene, a C++ port of the Java based Lucene. It's lacking devs, so it's some versions behind. An alternative to CLucene and also native C++ is Sphinx search (similar to CLucene what Nginx is to Apache). Also SQLite has an official full text search addon named FTS4.

Lucene in Action 1st and 2nd edition are great books, I have them both. The first edition was like the missing manual and it covers the API of the Lucene 1.4 with its rather rough API objects. Lucene 2.x+ API improved a lot.

there is also LucenePlusPlus: https://github.com/luceneplusplus/LucenePlusPlus

Great article. I've rolled my own full-text search engines in the past and it's a category of problems that I love, but even I have to admit that I'm often astounded by Lucene's performance. The inverted index really lets you stretch commodity hardware into pretty huge use-cases.

If you've never used ElasticSearch, I should note that that's one of ES's many strengths -- it takes advantage of Lucene and makes deployments on commodity hardware work really well. An ES cluster on five small EC2 instances can handle a tremendous workload.

There is one thing about ES/Lucene that bugs me though... in the 3+ years I've been running it in production, I still haven't been able to solve the "every once in a while java utilizes 100% CPU until you restart the service" issue. I suspect it has to do with Lucene's index merge operation, but no amount of tinkering has solved the problem.

One of the boons of Java is its remote debugging support.. you can attach a profiler to a process when something like this happens, extract thread names & stacks, and so on.

AFAIK you can also use the Linux 'perf trace' command on a Java process, but probably there is some more setup involved.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact