

Lucene's FuzzyQuery is 100 times faster in 4.0 - amnigos
http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

======
mgkimsal
This is not to pick on Lucene in particular - not 100% sure it even applies -
but I'm always a bit mixed when massive speed improvements take place in
projects. It tends to validate earlier criticisms about the speed of a
particular project which were dismissed (and often countered) by community
members.

"Java is slow." "No, it's not - it's hella fast!" "But it takes 4 minutes to
launch every Java app I ever use." "Dude, you're doing it wrong - everything I
write and use in Java is so fast I have to sometimes wonder if Java didn't
magically upgrade my CPU!" etc.

Then a new JVM will come out with measurably faster performance, which
validates the earlier view that something was indeed _slow_ , but it's
ignored/overlooked/glossedover by proponents.

The other response often is "the code's open, fix it yourself", which
generally does no one any good. Just because I'm qualified to determine that
something is _slow_ doesn't mean I have the foggiest clue how to fix it. Nor
am I encouraged that you'll actually take my patches back (even if I could
write them) because you (project team) don't seem to think there's a speed
problem in the first place.

This isn't to rag on Lucene's team - I've no idea if this applies to them in
particular - it's just something that popped in to my head when I read the
title.

The article itself is interesting, describing how the speedup was implemented
starting with Python - I won't spoil the rest of the story :)

~~~
m0th87
Java _is_ fast for long-running applications. Once the JIT kicks in, important
code is natively compiled. But it _does_ take a long time to start. This is
why Java is big in the server space, and not so much for desktop applications.

The same holds true for Lucene. It works wonderfully well for most queries.
These optimizations are strictly for fuzzy queries, which no one ever used in
my deployment at least.

This is the absurdity of performance measurement. Everyone wants to quantify
it into a single score. But doing so is like making a 100m track sprinter race
a cross country runner. Wtf is your criteria?

~~~
mgkimsal
I was using it as an example - KDE could have been another, or MySQL, or
whatever.

The criteria in the Java example - conversations I've had with people directly
- was launching and running a desktop app. Things like clicking a menu option
and being able to watch the menu draw (< 1 second, but still noticeable). I've
sat down with a couple of die-hard Java guys years ago and finally _showed_
them the slowness I would complain about. On my machine - click X, watch Y
take a long time, etc.

"Oh, that's fine! What are you complaining about?"

"Well, when I run a 'native' app, I don't notice any of these slow
operations."

"Oh, it's fine - that's hardly noticeable at all! Why do you care? Java's
fine!" and so on.

So... often judgements really are in the eye of the beholder, regardless of
what numerical benchmarks show.

------
pragmatic
Do users actually need to enter this syntax:

 _The QueryParser syntax is term~ or term~N_?

After considering Lucene, I built my in house search engine. I wanted it to
work a lot more like google than a "Full-text" library like search engine.

Very few users will go beyond the basics. How many users will actually used
Google advanced search? Even programmers? Very few. Users don't use advanced
search features.

Why? It's not their fault. They tried. They tried at the library - didn't
work. They tried on the early search engines - didn't work. They tried on your
Intranet app powered by database full text search - it doesn't work.

We trained them. We showed them that (most) advanced searching isn't worth
their time.

Why is? Revisioning + speed. Make it easy to try different combos of search
terms really fast. Correct spelling, suggest searches, add the ability to
filter information.

See: _Information Architecture for the World Wide Web_ p. 185.

Also available on Safari Books.

See Also: Google

~~~
techtalsky
We recently chose a Solr/Lucene solution for our game search at Big Fish Games
(launches tomorrow!). I cannot imagine writing from scratch many features that
come with Solr/Lucene: spelling correction, word stemming (walk == walking),
stop words (don't return every hit for "the"), sane defaults for tokenizing
(splitting up sentences into indexable and searchable chunks, like words), and
uhh soon Fuzzy Query matching.

It's certainly easy to use a very limited subset of Lucene's capabilities to
come up with a very intuitive user-searchable index of data.

If you wrote something from scratch that truly works better than an out of the
box Solr server, let's just say I'd be surprised.

~~~
stonemetal
If you can't imagine writing a spelling corrector from scratch you might want
to take a minute and broaden your horizons(I know it was magic pixie dust for
me before I read it). <http://norvig.com/spell-correct.html>

Stemming could be accomplished just as easily, put in a word get back a list
of stems. Just a dictionary look up that could be precomputed.

~~~
awj
I thought his comment was more along the lines of "I cannot imagine writing
industrial-strength versions of _all_ of these things Lucene gives me."

Yes, you can write a spellchecker in 21 lines of code, that doesn't
necessarily mean it will be fast enough to be a component in website search
_or_ that it will be any kind of a pleasure to query or maintain the word
corpus.

I can put together toy versions of many things Lucene provides pretty easily
in my own time. Building useful, dependable versions of most _anything_ takes
a nontrivial amount of time and effort, so it's smart to restrict my usage of
toys to understanding the concepts.

------
kilburn
My short summary of what the article is about...

Lucene's previous approach to fuzzy matching was to check the distance between
the input word and every word in the dictionary. Let's assume that their
implementation to compute the distance between two words was linear-time on
the size of the largest word being compared. Then, the complexity of this
method is _O(nm)_ , where _n_ is the average length of all words and _m_ is
the number of words on the dictionary.

The new algorithm uses some precomputed tables that define a parametrized
deterministic finite state machine. Then, when the user inputs a word, the
word is used to fix the parameters (this is, setting the rules on how to
traverse the states defined in the precomputed tables).

From the given information, it is unclear to me what the complexity of this
parameter fixing step is, but the paper states that its linear, so we'll say
_O(m)_. Then, discovering _all_ the words in the dictionary that are within a
fixed distance "d" has a complexity of _O(k+m)_ , where _k_ is the length of
the longest word in the dictionary.

Complexity wise, it is very clear that you can gain an X-fold increase in
performance by moving from one algorithm to the other, by simply growing the
dictionary until you get to your desired X.

Finally, I really dislike the author stating that "Unfortunately, the paper
was nearly unintelligible!". I've taken a quick look at it, and it is
immediately clear that the authors put a great deal of effort into it.
Further, it looks quite packed, but perfectly organized and written in a
clear-enough language...

~~~
xyzzyz
_Finally, I really dislike the author stating that "Unfortunately, the paper
was nearly unintelligible!". I've taken a quick look at it, and it is
immediately clear that the authors put a great deal of effort into it.
Further, it looks quite packed, but perfectly organized and written in a
clear-enough language..._

It does not matter how much effort is put or how clearly is a paper written,
to a someone without proper background it will always sound unintelligible.
Put yourself in a position of a person who never studied formal languages
theory and try to read it. It is utterly impossible. I do not say that it is a
bad thing (had researchers needed to put everything in layman's terms, they
would not have much time for actual research left), it is just understandable.

~~~
kilburn
Yeah, I totally understand that the paper must be very hard to catch for
someone without a minimal background on formal languages.

However, the author said that it is uninteligible, implying that _nobody_ can
understand it, and then supports this idea by elevating the guy who actually
understood and implemented it in his spare time to a wizard status.

Discrediting someone's work that much is plain wrong as I see it, even more if
you are using it to claim "100x speedups" in your project.

------
rivalis
On the one hand, this does not inspire confidence. It is disturbing to have
magic in one's software. On the other hand, the speed gains are really
impressive: I'm sure there are times when it is reasonable to make a magic vs.
utility tradeoff. Also, it's OSS: I'm sure someone will eventually want to
make a well-understood and documented version, and the devs seem like people
who would be willing to accept that.

~~~
conover
I completely agree. There is, of course, always a balance to be struck but
it's pretty difficult to ignore 100x speed up in something non-trivial.

Also, I read part of the paper linked in the post. It doesn't seem completely
inaccessible, just dense. I'm sure someone will eventually come up with a non-
hacky version.

------
mhp
I'm torn between being happy that Lucene is getting a giant speed boost and my
concern that the code is doing something magical which the programmers don't
understand. If there's a bug and it has to do with the algorithm, how will
they fix it?

~~~
afsina
the way they did is exactly what you feared. They got a complex python code
and converted it to Java using a converter tool AFAIK.

~~~
andrewcooke
i'm amazed by the text in that post. in general i appreciate people admitting
when they don't understand something, but the tone there goes beyond relaxed
to, well, a celebration of ignorance (greek letters! oh noes!). is it a joke?

~~~
vmind
It's also confusing to note they don't mention whether they even contacted the
authors of the paper to see if an implementation (even partial) had been made,
or clarification could be provided on implementation details.

------
aksbhat
Interesting article!

A lot of commenter here are scared of using Magik code.

However note that Search/Information Retrieval is a hard problem. Unlike other
problems, developing a generalized full text search engine is difficult.

Testing search algorithms is even more difficult e.g. NIST organizes TREC
conference <http://trec.nist.gov/pubs/call2011.html> in which a major emphasis
is on evaluation of search algorithms.

In fact Search is as what my advisor calls it, an AI-Complete problem, i.e.
creating a perfect search engine would amount to creating a Human like
artificial intelligence capable of understanding your query and the corpus.

------
markrmiller
Heh - Mike was exaggerating when he said unintelligible. While we are not
masters of that paper, we worked through it and understood the algorithm. We
could take a simple example and apply the steps - this is a very different
understanding than someone who studies and focuses on this field, yes. Given
time, we could have done the implementation without the python code - Mike's
recollection of the early part of this story is heavily 2nd hand.

However, even understanding the algorithm (if not masters of all the concepts
behind it), there was still a large gap to implementation. The solution that
was used allowed us to focus on pieces of that problem and accelerate
development fantastically.

Lucene has some of the best tests in open source software IMO. We are
confident in this code - whether it takes 2 or 3 people to properly maintain
or not.

The option before was a completely non scalable joke fuzzy query or nothing.
Now you have this option. Great. A little magic? Sure. Great :)

------
siculars
I have some interest in various edit distance implementations. The problem
with the basic edit distance is that you need to match each string against all
stored strings. This becomes an unbearable computational cost as the size of
your corpus increases.

It seems that what they have done here is create a rainbow table of sorts that
houses all the possible edits at distance 1 and 2. 3 is possible but requires
more space to store and more time to scan.

This is a very interesting problem and has application in many, many areas. I
always felt like there was more work to be done here and it looks like there
may still be yet. For example, an area that edit distance was initially
applied to was in person deduplication. When merging lists of names it is
important to identify duplicates and merge then appropriately. This is a
problem for me in medical informatics and is more devious than it sounds on
first blush.

~~~
nkurz
A "rainbow table" works in a somewhat similar manner, but is based on doing a
full calculation then saving only certain starting points. I think what they
are doing is actually more like creating a regular expression that matches all
words a particular Levenshtein distance from the target.

Here's more about what how they are doing it:
[http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-
Levensht...](http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-
Automata)

This article has a very good overview of the options:
<http://stevehanov.ca/blog/index.php?id=114>

------
ballard
[http://www.slideshare.net/rboulton/comparing-open-source-
sea...](http://www.slideshare.net/rboulton/comparing-open-source-search-
engines)

[http://www.osnews.com/story/21782/Open_Source_Search_Engine_...](http://www.osnews.com/story/21782/Open_Source_Search_Engine_Benchmarks)

Also there's holumbus <http://holumbus.fh-wedel.de/>, an active OSS search
engine written in haskell.

------
ollysb
A few years back I worked on a project that had switched to autonomy from
lucene. They'd had problems getting good enough results from automony so
decided to plough some money into the problem and go for what was considered
the best solution at the time. My impression now is that lucene has come a
very long way, does anyone know how they compare today?

------
quinndupont
That's a pretty obscene performance gain. Either that new algorithm was magic,
or the one before was pretty crappy. When else do you see this kind of
performance gain in production code?

~~~
lzm
As the article says, the previous one was a brute force implementation. And
the 100x number is kind of nonsensical since the real speedup depends on the
size of the input (the asymptotic complexity of the algorithm went from O(nmk)
to something like O(n+mk) I believe).

