Still, I wonder what the overhead of Java is adding in this case. Even minor things like integer decoding can be done very fast with SIMD... but such approaches don't seem amenable to Java. I see that Elasticsearch exposes quite a bit of GC metrics, which must be a problem at times. And one of the Lucene devs wrote a post on how he replaced some parts with C++ and saw massive gains (but with a disclaimer that this was in no way indicative that Java wasn't fast).
I've considered trying to implement something like Lucene in, say, Rust, but then I see just how utterly massive Lucene is. Just the fuzzy search part alone required implementing code to generate code from a Russian PhD thesis they didn't fully understand. So, no matter how many cycles the JVM is needlessly burning, Lucene just seems to advanced to write it without the overhead. (And maybe my intuition is just wrong and the overhead is only a few percent.)
GC is a big problem when you don't know the expected query distribution which is the case for Elasticsearch's analytics. There is a lot more to a search engine than packing, decoding and merging posting lists. I've never seen anything that compares with Lucene text analysis and scoring API supports.
First, the paper really doesn't seem so difficult. Second, they don't even think about reaching out to the author/s?
I suppose I shouldn't talk until I've tried their task. But I've implemented a lot of algorithms from papers, and their story had me shaking my head.
I'm really not experienced reading papers and this was the only one I ever tried, so I cannot compare it to others. It certainly was quite hard to follow for me and took some month of nightly dabbling before I reached the point above.
Lucene & Hadoop meant a big push for the Java eco-system, it's like a lock-in. Native C++ libraries and other free text search implementations have a smaller community and are usually less known. With Go, C++11 and Rust the future looks bright but it will take some time to catch up.
One of my colleagues, Marty Schoch, has been working on a full text search engine in golang, called bleve 
I have. The problem is picking one that is mature enough and will be supported for years as you can expect ElasticSearch to be, which is what I meant by "an alternative".
I agree with you about the future looking bright but I meant something you could use right now.
One could also use a service oriented architecture and use e.g. ElasticSearch Rest API or C++ based Sphinx Search, both need litttle configuration and no custom code.
Lucene in Action 1st and 2nd edition are great books, I have them both. The first edition was like the missing manual and it covers the API of the Lucene 1.4 with its rather rough API objects. Lucene 2.x+ API improved a lot.
If you've never used ElasticSearch, I should note that that's one of ES's many strengths -- it takes advantage of Lucene and makes deployments on commodity hardware work really well. An ES cluster on five small EC2 instances can handle a tremendous workload.
There is one thing about ES/Lucene that bugs me though... in the 3+ years I've been running it in production, I still haven't been able to solve the "every once in a while java utilizes 100% CPU until you restart the service" issue. I suspect it has to do with Lucene's index merge operation, but no amount of tinkering has solved the problem.
AFAIK you can also use the Linux 'perf trace' command on a Java process, but probably there is some more setup involved.