

Inaugural release of Apache Lucy, Version 0.1.0 - mthomas
http://mail-archives.apache.org/mod_mbox/incubator-general/201106.mbox/%3C7C5E177D-DA13-4B41-866A-2279D9F17BE1@jpl.nasa.gov%3E

======
endgame
When announcing things with cute names, can people please put a short
description in the link?

For everyone else: Lucy's apparently a full-text search library written in C
targeting dynamic languages, with Perl bindings to start with.

------
chuhnk
The announcement says: Lucy is a "loose C" port of Apache Lucene, a search
engine library for Java -- it is similar in purpose to Lucene, but designed to
take advantage of C's unique capabilities.

I'm wondering what these unique capabilities are. Speed? Smaller memory
footprint? And I wonder what the reason is behind doing this. I'm all for C
project but very curious as to why when Lucene was very well done.

~~~
nkurz
I'm one of the developers, although not currently very active. The main
"unique capability" is closer integration to the machine. Our approach has
been "The OS is our VM".

We use mmap() heavily, and when running on 64-bit systems take liberal
advantage of the giant address space. Using the system to do more of the
buffering also allows us to have lightweight processes that can start quickly.
We think there will be both speed and memory advantages in the long run. The
other main difference is that C is much easier to integrate with other
languages than Java. We're starting out with Perl bindings, but have plans for
Ruby, Python, Lua, Tcl, and others. The goal is to offer a truly native
interface from the language of your choice.

The degree of host language integration is wild. You'll be able to seamlessly
subclass just about any part of the C library in any supported language.
Nothing is ready beyond C and Perl, but eventually you'll be able to have your
indexer in one language, and your customized searchers in a couple more, while
all sharing the same shared system cache.

As for why? Marvin, the main developer started the project as KinoSearch at a
time when Lucene wasn't really ready for prime time. He's been very interested
in real time indexing, and at the time Lucene didn't handle this well. I got
interested because I was looking for something lighter weight than Lucene,
where I could try to blend the boundaries between search and database
retrieval. Lucene had too many layers of abstraction for my purposes. A
parallel might be SQLite and Postgres. Both have their place, but Lucy is more
on the SQLite side of things.

------
ojosilva
Here's the Perl binding library in CPAN, in its simplest form:
<http://search.cpan.org/perldoc?Lucy::Simple>

The synopsis is quite elucidative. Just cpanm installed it and in 10 minutes
had a program that indexes and searches a collection of files with
highlighting. Looks promising!

------
z92
How much better is it compared to CLucene? CLucene got stuck at 1.9 and now
shows very little activity, while Java Lucene is rolling towards version 4.
But still CLucene was as less memory hog and faster than Java Lucene at it's
active time. They claim it was 2.5 times faster.

If Lucy can deliver the latest progresses in Java Lucene as a usable C
library, that should be a very good news for me. Lucene still is the best
choice for large data indexing and searching solutions.

~~~
nkurz
CLucene aimed for binary compatibility with Lucene indexes, and as a result
had very little room to innovate. Lucy started out with the same approach, but
decided early on that it was better to take the best parts of Lucene's
internals while not being bound to all of them.

Some parts will be leading Lucy, and some will be catching up. There's already
increasing cross-pollination between the two. It's a very loose port at this
point.

------
rbrown46
I tried to get into Lucene (using SOLR) recently but was put off by it's
complexity for what was, in my case, a simple use case (searching through a
large document set of html, txt, and doc files quickly using proximity
search).

After futzing for hours with XSLT and writing scripts to submit content via
the REST API, I found out about FTS4 in SQLite, and was impressed by it's
relative simplicity. I had something working in under an hour in Python.

~~~
fizx
Wait, what!? You have at least two good options for Solr libaries in Python,
neither of which brings you anywhere close to xslt.

\- <http://haystacksearch.org/> \- <http://code.google.com/p/pysolr/>

~~~
nzadrozny
Also, Sunburnt: <https://github.com/tow/sunburnt/>

------
toisanji
Why would apache incubate a competing product like this? And what exactly are
the unique capabilities that this project can take advantage of? Lucene is
already extremely easy to interface to since its just a rest interface.

~~~
pjscott
Solr provides a REST interface. Lucene is a Java library.

------
powertower
clicky <http://incubator.apache.org/lucy/>

