This implementation uses set semantics with "and" as the only Boolean operator (Boolean retrieval model) - the dominant paradigm of the 1960s library catalog search systems.
The Lucene library and engines on top (Elastic, Solr) use ranked retrieval, so resuls are scored and returned in order from most to least relevant. For example, the Vector Space Model (VSM) computes the cosine of the vector that represents the query (vector of all query term frequencies, other components 0) with the document vectors (vector of all document term frequencies, other components 0), and there are other models such as the Probabilistic Model (also known as BM25 or Okapi, after the version of the ranking formula used or the name of the software system implementing it, respectively) or DfR (Divergence from Randomness), another Information Retrieval (IR) model type.
Since 1972 terms are weighted with the popular TDIDF weighting scheme (Term Frequency times Inverse Document collection Frequency, Spärck Jones (1972) J. Doc.) and often apply length normalization. I'd argue that these are important ingredients in a minimal implementation (and actually the first Lucene version had them).
Without ranked retrieval there is no guarantee that the most relevant result in a list of 10,000 documents is not returned last.
Good catch! But I don't think it matters because I didn't implement full text search so there is no question that all documents returned are equally relevant.
I thoroughly enjoyed this post! I also didn't perceive the title as any kind of attempt at "clickbait." Thanks for sharing. I would like interested in seeing more installments of developing this Elasticsearch style API too :)
How you save/retrieve bytes isn't the interesting part of this blog post as it's elementary. He could have just defined an interface { save(id, doc), get(id) } and moved on.
If anyone seriously wanted to try to follow along with this code, googling how to use the filesystem in Go isn't beyond them. They'd have their own implementation from scratch just as the blog promises.
>> How you save/retrieve bytes isn't the interesting part of this blog post as it's elementary
I do not have a beef with the author, at all, in any way, but, LOL, that is anything but elementary.
Furthermore,
>> this post will not implement full text search.
...makes their solution or specifically, their index, not very "Lucene-like". Which is why I had objections to their title, not their blog post. Their blog post is fine :)
Storage is one of the problems you'll need to solve, when doing "Lucene from scratch". Indices are another. Full-text indices implemented from scratch: hard and interesting.
Like most tutorials, surely it's a simple implementation that breaks things down into simple code so that you can see how things work at a more fundamental level and follow along, not something that's going to satisfy ACID properties and compete with Lucene.
Storage engines are far from elementary. It is one of the most complex and subtle parts of a database implementation, especially if you care about reliability, scalability, or performance. The term "from scratch" usually does not denote outsourcing all of the hard parts of a database implementation to an external library.
I think people were just disappointed that the article was more about writing a database-like API than an actual database.
Did you read the article? He addresses your point directly. Your criticism sounds a little petty, but, perhaps you're being flippant and it's lost on me..
I most certainly did, eatonphil's articles are always cool, in fact I wonder if he has a mailing list?
Anyhow, importing the indexer disqualifies the "from scratch" quality.
This doesn't mean there's anything wrong with importing it, only calling it "from nothing" when this isn't the case. Such language sets improper expectations in the reader and may even reduce reader engagement, since article content diverges from what the title claimed to offer.
Took me about an hour for a +40/-40 diff that is still <500 lines of code. You may notice that the code basically looks identical. That's because I wasn't using anything specific to a key-value store.
I've implemented a SQL database from scratch before (that one was only in-memory) so I wanted to try my hand at a document database and this time have it support real storage.
In general, posts on my personal blog are purely for fun and education. :)
In my off time I was writing a document layer atop cassandra.
Then if you add a basic relation capability between documents, then all of a sudden a document database becomes a poor man's property-graph database. It also was just a bit of an side project without any real production use.
The Lucene library and engines on top (Elastic, Solr) use ranked retrieval, so resuls are scored and returned in order from most to least relevant. For example, the Vector Space Model (VSM) computes the cosine of the vector that represents the query (vector of all query term frequencies, other components 0) with the document vectors (vector of all document term frequencies, other components 0), and there are other models such as the Probabilistic Model (also known as BM25 or Okapi, after the version of the ranking formula used or the name of the software system implementing it, respectively) or DfR (Divergence from Randomness), another Information Retrieval (IR) model type.
Since 1972 terms are weighted with the popular TDIDF weighting scheme (Term Frequency times Inverse Document collection Frequency, Spärck Jones (1972) J. Doc.) and often apply length normalization. I'd argue that these are important ingredients in a minimal implementation (and actually the first Lucene version had them).
Without ranked retrieval there is no guarantee that the most relevant result in a list of 10,000 documents is not returned last.