This is not to pick on Lucene in particular - not 100% sure it even applies - but I'm always a bit mixed when massive speed improvements take place in projects. It tends to validate earlier criticisms about the speed of a particular project which were dismissed (and often countered) by community members.
"Java is slow."
"No, it's not - it's hella fast!"
"But it takes 4 minutes to launch every Java app I ever use."
"Dude, you're doing it wrong - everything I write and use in Java is so fast I have to sometimes wonder if Java didn't magically upgrade my CPU!"
etc.
Then a new JVM will come out with measurably faster performance, which validates the earlier view that something was indeed slow, but it's ignored/overlooked/glossedover by proponents.
The other response often is "the code's open, fix it yourself", which generally does no one any good. Just because I'm qualified to determine that something is slow doesn't mean I have the foggiest clue how to fix it. Nor am I encouraged that you'll actually take my patches back (even if I could write them) because you (project team) don't seem to think there's a speed problem in the first place.
This isn't to rag on Lucene's team - I've no idea if this applies to them in particular - it's just something that popped in to my head when I read the title.
The article itself is interesting, describing how the speedup was implemented starting with Python - I won't spoil the rest of the story :)
Java is fast for long-running applications. Once the JIT kicks in, important code is natively compiled. But it does take a long time to start. This is why Java is big in the server space, and not so much for desktop applications.
The same holds true for Lucene. It works wonderfully well for most queries. These optimizations are strictly for fuzzy queries, which no one ever used in my deployment at least.
This is the absurdity of performance measurement. Everyone wants to quantify it into a single score. But doing so is like making a 100m track sprinter race a cross country runner. Wtf is your criteria?
I was using it as an example - KDE could have been another, or MySQL, or whatever.
The criteria in the Java example - conversations I've had with people directly - was launching and running a desktop app. Things like clicking a menu option and being able to watch the menu draw (< 1 second, but still noticeable). I've sat down with a couple of die-hard Java guys years ago and finally showed them the slowness I would complain about. On my machine - click X, watch Y take a long time, etc.
"Oh, that's fine! What are you complaining about?"
"Well, when I run a 'native' app, I don't notice any of these slow operations."
"Oh, it's fine - that's hardly noticeable at all! Why do you care? Java's fine!" and so on.
So... often judgements really are in the eye of the beholder, regardless of what numerical benchmarks show.
Also, for the record, I've not thought Lucene was slow for any use cases I've needed it for. I've seen benchmarks showing other tech faster in some cases, but never needed that performance.
"Java is fast for long-running applications. Once the JIT kicks in, important code is natively compiled. This is why Java is big in the server space, and not so much for desktop applications."
Java is a pain in the server space. The memory footprint is bloated, so cache locality tends to suck; it's prone to nasty garbage-collection pauses (hello, latency spikes!); and now instead of one process that can crash or go funky (your code), now you've got two (the JVM and your code). And that's just the tip of the iceberg.
I honestly don't know why people started writing server software in Java, but I wish they would stop (I'm looking at you, Apache project). There's absolutely nothing that Java is "good" at that couldn't be done better in other languages...except for allowing teams of mediocre developers to write frameworks for big software bureaucracies.
Lucene has basically always been really fast for just-plain-indexing-and-search. Certain additional features are pretty expensive to do the obvious way and came in as "it works" first and "it's fast" later. Reading the blog entry, this seems to be the case for fuzzy searches. One thing to take note of is that you can't just change your index structure for one feature, that might make other features slower and not worth it.
One thing I'd forgot to mention, and is probably applicable in this case, is that 'non-standard' approaches may in fact not be computationally feasibly when a project first starts out. Precomputing loads of tables of data makes sense when you're on 3ghz processors, but probably not if you're on 200mhz CPUs, for example.
Yeah, I remember when a new version of Android came out that was 4x faster and I felt like I was the only one who interpreted that to mean that most of the CPU cycles were completely wasted by the previous versions. The same thing applies to OS X, where every version has been faster on the same hardware.
People talk about software performance, but another way to look at things is that hardware provides performance and software (especially OS/middleware) only provides overhead. So software doesn't really get faster, just less wasteful.
After considering Lucene, I built my in house search engine. I wanted it to work a lot more like google than a "Full-text" library like search engine.
Very few users will go beyond the basics. How many users will actually used Google advanced search? Even programmers? Very few. Users don't use advanced search features.
Why? It's not their fault. They tried. They tried at the library - didn't work. They tried on the early search engines - didn't work. They tried on your Intranet app powered by database full text search - it doesn't work.
We trained them. We showed them that (most) advanced searching isn't worth their time.
Why is? Revisioning + speed. Make it easy to try different combos of search terms really fast. Correct spelling, suggest searches, add the ability to filter information.
See: Information Architecture for the World Wide Web p. 185.
If you're using Lucene, it's not too difficult to write your own query parser that does exactly what you want, including fuzzy searches, spelling correction, term weighting, etc. My current employer has a rather involved one that can take advantage of lots of application specific data.
Writing a custom query parser allows you to do as much hidden magic on the query as you want, and expose your own special operators if you really want to. Check out some of our user accessible operators at http://help.greplin.com/customer/portal/articles/15527-how-c...
Incidentally, I'd love to exchange notes and hear about your your search engine sometime. If you're interested, shoot me an email (address in profile).
After considering Lucene, I built my in house search engine.
Yeah, totally makes sense to write your own search engine instead.
Or you could use Lucene and have a tiny bit of code in front of it that tacks on "~" to query terms first and save yourself some time. Or spend a few minutes reading the Lucene docs and find out how to do what you'd like to do. I've had the misfortune of dealing with homegrown search engines...it was laughable to see all that work (requiring a bunch of heavy-duty servers, lots of development time) replaced with a Lucene index that took half a day to get up and running with much better performance+features on a tiny VPS.
A general idea is to 'compile' user queries to a lucene query with the various default options that give the best search, and not to use raw lucene queries from the user. The knobs need to be there for turning, whether you expose those knobs to the end user is a design decision.
We recently chose a Solr/Lucene solution for our game search at Big Fish Games (launches tomorrow!). I cannot imagine writing from scratch many features that come with Solr/Lucene: spelling correction, word stemming (walk == walking), stop words (don't return every hit for "the"), sane defaults for tokenizing (splitting up sentences into indexable and searchable chunks, like words), and uhh soon Fuzzy Query matching.
It's certainly easy to use a very limited subset of Lucene's capabilities to come up with a very intuitive user-searchable index of data.
If you wrote something from scratch that truly works better than an out of the box Solr server, let's just say I'd be surprised.
If you can't imagine writing a spelling corrector from scratch you might want to take a minute and broaden your horizons(I know it was magic pixie dust for me before I read it). http://norvig.com/spell-correct.html
Stemming could be accomplished just as easily, put in a word get back a list of stems. Just a dictionary look up that could be precomputed.
I thought his comment was more along the lines of "I cannot imagine writing industrial-strength versions of all of these things Lucene gives me."
Yes, you can write a spellchecker in 21 lines of code, that doesn't necessarily mean it will be fast enough to be a component in website search or that it will be any kind of a pleasure to query or maintain the word corpus.
I can put together toy versions of many things Lucene provides pretty easily in my own time. Building useful, dependable versions of most anything takes a nontrivial amount of time and effort, so it's smart to restrict my usage of toys to understanding the concepts.
I don't want to sound like a troll, but Lucene's built-in spellchecker is a toy. If you're doing much more than trying to just plonk in a search engine on your website, you'll probably find it insufficient.
As far as stemming goes, AFAIK, Lucene didn't invent anything. It's a lot like crypto, in that nearly everyone uses public implementations of of stemmers for different languages: http://snowball.tartarus.org/
The previous two points both apply to stop words (you'll almost certainly want to customize your own list, and there are good, curated lists out there for the most common languages), and Lucene's tokenizer options aren't really that smart -- if you want to get into chunking, phrase identification, etc., you're going to have to write it yourself. But then you're going to be working within the Lucene world, tied to their index format, mucking around in the depths of unfamiliar code, etc. That's a big trade-off.
Basically, Lucene is great if you're working on a website and you need a drop-in solution for search, but if your core user experience is search, it has a lot of deficiencies.
Are you saying you wrote an alternative to Lucene because users don't like to type in search queries that Lucene understands? Lucene is a low-level library for implementing a search experience. If you don't understand that then I don't think you understand how a sophisticated search engine like Google works.
Say you type in:
"Lucene equivalent in C++ or C#"
This query is not passed as-is to some magic implementation of a Search Engine by Google.
First they need to parse this query. This involves a tokenizer step which will have many rules depending on the context and the language requirements, etc. An industrial strength tokenizer is a significant undertaking in itself.
The next step is turn this user query into something that a search engine understands. Unless your default mode of operation is "match this user query EXACTLY" then that means having some kind of query language. For any search engine, this will end up looking like the syntax you have a problem with in your post.
Next, to have any kind of useful and competitive search experience you will need to take the user query and generates tens, maybe even hundreds or thousands, of different sub-queries that try to explore different estimates of user intent and fuzzy-matching. For example, you might do some analysis and realise that equivalent in this context is not very important and decide to generate another twenty queries with all-known synonyms for the word equivalent. You might decide to try the query with certain words dropped altogether.
And of course, you also need to be able to rank results. Not just within sub-queries but when merging all the results from sub-queries together. This takes sophisticated and highly customised scoring algorithms that will be dependent on context and a whole lot of other things. Essentially you need to guess what is the best match based on what you think the user wants.
Finally, to be an industry leader like Google you will use a data-driven approach to scoring that makes use of machine-learning techniques to keep your advantage more than a purely technology one.
For companies that just want a basic search that will behave intuitively for people used to the Google model, Lucene helps with A LOT of this. Their default query parser, scorer and other vital components are good enough that most websites can just use the default configurations out of the box and get a good search experience.
All that you're left with in the end is to create the basic user interface abstraction - the "one box" experience. Solr helps enormously with this, but in the end all the functionality it exposes is Lucene.
For more complicated search experiences, there isn't much that I described in the above approach that you can't do in Lucene.
If you are a PHP dev you don't need to bother wasting your time with a wrapper on top of Lucene. There's a very nice implementation called ElasticSearch which already has most of the functionality you'd be seeking.
Can you tell us more about your search engine? I'm building an online store and search is going to be crucial for us. Currently, I'm evaluating Lucene (solr).
On the one hand, this does not inspire confidence. It is disturbing to have magic in one's software. On the other hand, the speed gains are really impressive: I'm sure there are times when it is reasonable to make a magic vs. utility tradeoff. Also, it's OSS: I'm sure someone will eventually want to make a well-understood and documented version, and the devs seem like people who would be willing to accept that.
I completely agree. There is, of course, always a balance to be struck but it's pretty difficult to ignore 100x speed up in something non-trivial.
Also, I read part of the paper linked in the post. It doesn't seem completely inaccessible, just dense. I'm sure someone will eventually come up with a non-hacky version.
Lucene's previous approach to fuzzy matching was to check the distance between the input word and every word in the dictionary. Let's assume that their implementation to compute the distance between two words was linear-time on the size of the largest word being compared. Then, the complexity of this method is O(nm), where n is the average length of all words and m is the number of words on the dictionary.
The new algorithm uses some precomputed tables that define a parametrized deterministic finite state machine. Then, when the user inputs a word, the word is used to fix the parameters (this is, setting the rules on how to traverse the states defined in the precomputed tables).
From the given information, it is unclear to me what the complexity of this parameter fixing step is, but the paper states that its linear, so we'll say O(m). Then, discovering all the words in the dictionary that are within a fixed distance "d" has a complexity of O(k+m), where k is the length of the longest word in the dictionary.
Complexity wise, it is very clear that you can gain an X-fold increase in performance by moving from one algorithm to the other, by simply growing the dictionary until you get to your desired X.
Finally, I really dislike the author stating that "Unfortunately, the paper was nearly unintelligible!". I've taken a quick look at it, and it is immediately clear that the authors put a great deal of effort into it. Further, it looks quite packed, but perfectly organized and written in a clear-enough language...
Finally, I really dislike the author stating that "Unfortunately, the paper was nearly unintelligible!". I've taken a quick look at it, and it is immediately clear that the authors put a great deal of effort into it. Further, it looks quite packed, but perfectly organized and written in a clear-enough language...
It does not matter how much effort is put or how clearly is a paper written, to a someone without proper background it will always sound unintelligible. Put yourself in a position of a person who never studied formal languages theory and try to read it. It is utterly impossible. I do not say that it is a bad thing (had researchers needed to put everything in layman's terms, they would not have much time for actual research left), it is just understandable.
Yeah, I totally understand that the paper must be very hard to catch for someone without a minimal background on formal languages.
However, the author said that it is uninteligible, implying that nobody can understand it, and then supports this idea by elevating the guy who actually understood and implemented it in his spare time to a wizard status.
Discrediting someone's work that much is plain wrong as I see it, even more if you are using it to claim "100x speedups" in your project.
I'm torn between being happy that Lucene is getting a giant speed boost and my concern that the code is doing something magical which the programmers don't understand. If there's a bug and it has to do with the algorithm, how will they fix it?
I think I came away with almost the opposite impression from reading the article.
From my perspective, they did not implement it. Mark Miller and Robert Muir were able to implement the algorithm for the N=1 case, but were stuck until they found the existing Moman code. They did not implement their own code using Moman as a reference implementation, but just used Moman's code to generate the required tables.
From the article, "Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code". This sounds to me like they did not have a good understanding of the algorithms they were trying to implement.
They did do a fair bit of testing and did uncover a bug in the Moman code base, but, again, they did not fix this bug themselves, but appealed to Jean-Phillipe who then quickly fixed his code -- in effect, they were relying on a third-party.
And, yes, they did apply the end result to make fuzzy searching a lot faster, which is a good and practical end. It took a lot of effort on the Lucene's team part to get this feature implemented, but that does not mean that anyone has a good understanding of the end result.
In short, I don't get the impression that anyone on the Lucene team could give a 1 hour talk on implementing the Klaus Schulz and Stoyan Mihov paper to a formal language and automata audience.
i'm amazed by the text in that post. in general i appreciate people admitting when they don't understand something, but the tone there goes beyond relaxed to, well, a celebration of ignorance (greek letters! oh noes!). is it a joke?
It's also confusing to note they don't mention whether they even contacted the authors of the paper to see if an implementation (even partial) had been made, or clarification could be provided on implementation details.
A lot of commenter here are scared of using Magik code.
However note that Search/Information Retrieval is a hard problem. Unlike other problems, developing a generalized full text search engine is difficult.
Testing search algorithms is even more difficult e.g. NIST organizes TREC conference http://trec.nist.gov/pubs/call2011.html in which a major emphasis is on evaluation of search algorithms.
In fact Search is as what my advisor calls it, an AI-Complete problem, i.e. creating a perfect search engine would amount to creating a Human like artificial intelligence capable of understanding your query and the corpus.
Heh - Mike was exaggerating when he said unintelligible. While we are not masters of that paper, we worked through it and understood the algorithm. We could take a simple example and apply the steps - this is a very different understanding than someone who studies and focuses on this field, yes. Given time, we could have done the implementation without the python code - Mike's recollection of the early part of this story is heavily 2nd hand.
However, even understanding the algorithm (if not masters of all the concepts behind it), there was still a large gap to implementation. The solution that was used allowed us to focus on pieces of that problem and accelerate development fantastically.
Lucene has some of the best tests in open source software IMO. We are confident in this code - whether it takes 2 or 3 people to properly maintain or not.
The option before was a completely non scalable joke fuzzy query or nothing. Now you have this option. Great. A little magic? Sure. Great :)
I have some interest in various edit distance implementations. The problem with the basic edit distance is that you need to match each string against all stored strings. This becomes an unbearable computational cost as the size of your corpus increases.
It seems that what they have done here is create a rainbow table of sorts that houses all the possible edits at distance 1 and 2. 3 is possible but requires more space to store and more time to scan.
This is a very interesting problem and has application in many, many areas. I always felt like there was more work to be done here and it looks like there may still be yet. For example, an area that edit distance was initially applied to was in person deduplication. When merging lists of names it is important to identify duplicates and merge then appropriately. This is a problem for me in medical informatics and is more devious than it sounds on first blush.
A "rainbow table" works in a somewhat similar manner, but is based on doing a full calculation then saving only certain starting points. I think what they are doing is actually more like creating a regular expression that matches all words a particular Levenshtein distance from the target.
A few years back I worked on a project that had switched to autonomy from lucene. They'd had problems getting good enough results from automony so decided to plough some money into the problem and go for what was considered the best solution at the time. My impression now is that lucene has come a very long way, does anyone know how they compare today?
That's a pretty obscene performance gain. Either that new algorithm was magic, or the one before was pretty crappy. When else do you see this kind of performance gain in production code?
As the article says, the previous one was a brute force implementation. And the 100x number is kind of nonsensical since the real speedup depends on the size of the input (the asymptotic complexity of the algorithm went from O(nmk) to something like O(n+mk) I believe).
"Java is slow." "No, it's not - it's hella fast!" "But it takes 4 minutes to launch every Java app I ever use." "Dude, you're doing it wrong - everything I write and use in Java is so fast I have to sometimes wonder if Java didn't magically upgrade my CPU!" etc.
Then a new JVM will come out with measurably faster performance, which validates the earlier view that something was indeed slow, but it's ignored/overlooked/glossedover by proponents.
The other response often is "the code's open, fix it yourself", which generally does no one any good. Just because I'm qualified to determine that something is slow doesn't mean I have the foggiest clue how to fix it. Nor am I encouraged that you'll actually take my patches back (even if I could write them) because you (project team) don't seem to think there's a speed problem in the first place.
This isn't to rag on Lucene's team - I've no idea if this applies to them in particular - it's just something that popped in to my head when I read the title.
The article itself is interesting, describing how the speedup was implemented starting with Python - I won't spoil the rest of the story :)