
Lucene: The Good Parts - SkyRocknRoll
http://blog.parsely.com/post/1691/lucene/
======
latchkey
Here is a bit of interesting history that I bet few people know about.

Doug Cutting worked on VTwin at Apple long before he wrote Lucene. This was a
codename for Apple's search technology they were building into the OS. I knew
about it because at the time I was building computer labs for my college and
bought a lot of Apple hardware (about a million dollars worth) and my rep
added me to the private beta.

Many years later, when I saw the Lucene project being worked on and saw that
Doug was behind it, I immediately reached out to him and asked him if he was
interested in joining the Jakarta group, as I thought this would be a great
addition to our growing community.

Needless to say, the rest is history, but I feel like if nobody had reached
out to Doug, he may not have gotten as much exposure and may not have been
motivated to start the rest of the amazing projects that he's worked on,
including Hadoop.

=)

[1]
[https://www.google.com/search?q=apple+vtwin+codename](https://www.google.com/search?q=apple+vtwin+codename)

------
derefr
> SQL was not then, and is still not now, a very good blob or document storage
> system.

What does that even _mean_? SQL is a wire protocol for relational queries.
Document/blob/key-value stores can be made to speak SQL. Most of them just
choose not to implement SQL support for some weird reason.

~~~
pixelmonkey
(Author here.) As others have pointed out, I am here referring to SQL as a
broad term for "SQL RDBMSes". When I say that it is not a good blob or
document storage system, what I mean is this: You can plop unstructured data
into a SQL RDBMS using something like JSON serialization, but it's not
generally a good idea. Document storage requires flexible schemas, so SQL
schemas are also not generally a great idea -- you end up with tables with
lots of nullable fields that are usually empty.

As an example, imagine a single database where you want to store actual
desktop documents, such as the formats supported by Apache Tika:
[https://tika.apache.org/1.8/formats.html](https://tika.apache.org/1.8/formats.html)
\-- If you try to model this using a SQL schema, you'll likely be in for a
world of pain. From a UX standpoint, a user just wants to "search across all
documents", but you have hundreds of heterogeneous types with varying degrees
of field-level compatibility.

~~~
bunderbunder
What do you mean by "not generally a good idea"? In the past I've had little
cause to complain about using XML fields to store and search semi-structured
data in MS SQL Server.

I have read that it doesn't scale out as nicely as document stores do, and
knowing SQL Server I don't have trouble believing that. Personally I'm a very
long way away from needing to worry about scale out[1] in the applications I
use a DBMS for, though, so that's never really kept me up at night.

Word I've heard on the street is that the story's similar for PostgreSQL.

[1]: [http://yourdatafitsinram.com](http://yourdatafitsinram.com)

------
pixelmonkey
I was just informed today that this article, "Lucene: The Good Parts", which I
wrote a few months ago, was also just published in Hacker Monthly for this
month's print/digital issue (June 2015):

[http://hackermonthly.com/issue-61.html](http://hackermonthly.com/issue-61.html)

------
ddorian43
I still don't understand why lucene can't use the original document for
aggregations ? I undestand that it will be slower if the field isn't indexed,
but it should be doable like rdbms do.

~~~
boomzilla
It can, as long as you store all the fields that will be needed for
aggregation. Stored, non-indexed field retrieval can be slow as it might
involve a lot of random seeks, so it might be slower than relation dbs.

What use case do you have in mind?

~~~
ddorian43
I mean in the case of _source field that elasticsearc uses. But seems that is
separate. My usecase is I want to do querying+filtering on all fields of the
document. In this case you have to:

1.store the field or document so you can get back the value (ex: when querying
the document)

2.index the field for filtering

3.separately index the field for aggregation(doc_values)

While in rdbms you only need(1).

------
prasanthv
Great article! Never even knew about Lucene.

------
bracewel
_sigh_ HTTPS mixed content... really?

