Hacker News new | past | comments | ask | show | jobs | submit login
Writing a document database from scratch in Go: Lucene-like filters and indexes (eatonphil.com)
168 points by eatonphil on March 28, 2022 | hide | past | favorite | 21 comments



This implementation uses set semantics with "and" as the only Boolean operator (Boolean retrieval model) - the dominant paradigm of the 1960s library catalog search systems.

The Lucene library and engines on top (Elastic, Solr) use ranked retrieval, so resuls are scored and returned in order from most to least relevant. For example, the Vector Space Model (VSM) computes the cosine of the vector that represents the query (vector of all query term frequencies, other components 0) with the document vectors (vector of all document term frequencies, other components 0), and there are other models such as the Probabilistic Model (also known as BM25 or Okapi, after the version of the ranking formula used or the name of the software system implementing it, respectively) or DfR (Divergence from Randomness), another Information Retrieval (IR) model type.

Since 1972 terms are weighted with the popular TDIDF weighting scheme (Term Frequency times Inverse Document collection Frequency, Spärck Jones (1972) J. Doc.) and often apply length normalization. I'd argue that these are important ingredients in a minimal implementation (and actually the first Lucene version had them).

Without ranked retrieval there is no guarantee that the most relevant result in a list of 10,000 documents is not returned last.


Good catch! But I don't think it matters because I didn't implement full text search so there is no question that all documents returned are equally relevant.


Go folk might get a kick out of this but

>> I said this project is "from scratch" but I'm going to bring in a storage engine

Oh, so... not from scratch, then?


That's kind of a nasty bait and switch, definitely not from scratch if you pick the indexer core up from GitHub on the way home after work.

Phil, I'm a bit disappointed in you. This hasn't been the sort of thing I've observed from your (excellent) prior writings.


You people are harsh! :)

I'll put it this way: I only write about stuff that I think is interesting. And I don't believe my title is clickbait.

If you read the article and still think it's not interesting or not my normal level of educational, well then I'll be surprised.


It is interesting. If the indexer is being imported, it's also not "from scratch", at all. The two are not mutually exclusive.

Your piece will stand on it's own, perhaps call it something like:

Implementing The Elasticsearch Backend Myself in Go

Phil, your editorial honesty, integrity, and accuracy will be appreciated by me and others. <3


I thoroughly enjoyed this post! I also didn't perceive the title as any kind of attempt at "clickbait." Thanks for sharing. I would like interested in seeing more installments of developing this Elasticsearch style API too :)


I don't get the emotions.

How you save/retrieve bytes isn't the interesting part of this blog post as it's elementary. He could have just defined an interface { save(id, doc), get(id) } and moved on.

If anyone seriously wanted to try to follow along with this code, googling how to use the filesystem in Go isn't beyond them. They'd have their own implementation from scratch just as the blog promises.


>> How you save/retrieve bytes isn't the interesting part of this blog post as it's elementary

I do not have a beef with the author, at all, in any way, but, LOL, that is anything but elementary.

Furthermore,

>> this post will not implement full text search.

...makes their solution or specifically, their index, not very "Lucene-like". Which is why I had objections to their title, not their blog post. Their blog post is fine :)

Storage is one of the problems you'll need to solve, when doing "Lucene from scratch". Indices are another. Full-text indices implemented from scratch: hard and interesting.


> LOL, that is anything but elementary.

Like most tutorials, surely it's a simple implementation that breaks things down into simple code so that you can see how things work at a more fundamental level and follow along, not something that's going to satisfy ACID properties and compete with Lucene.

Way too much guff in these comments.


It's a very good run-through but it's not at all from scratch. That's my sole point. Didn't mean to be harsh, but their title is click-bait.


Storage engines are far from elementary. It is one of the most complex and subtle parts of a database implementation, especially if you care about reliability, scalability, or performance. The term "from scratch" usually does not denote outsourcing all of the hard parts of a database implementation to an external library.

I think people were just disappointed that the article was more about writing a database-like API than an actual database.


I’ll be anxiously awaiting your “from scratch” tutorials :)

Edit: not facetious. I very much enjoy reading your blog posts and HN comments.


Did you read the article? He addresses your point directly. Your criticism sounds a little petty, but, perhaps you're being flippant and it's lost on me..


I most certainly did, eatonphil's articles are always cool, in fact I wonder if he has a mailing list?

Anyhow, importing the indexer disqualifies the "from scratch" quality.

This doesn't mean there's anything wrong with importing it, only calling it "from nothing" when this isn't the case. Such language sets improper expectations in the reader and may even reduce reader engagement, since article content diverges from what the title claimed to offer.


Hey since my paragraph saying it's easy to swap out storage engines wasn't believable, here's a diff without any 3rd party library:

https://github.com/eatonphil/docdb/pull/1

Took me about an hour for a +40/-40 diff that is still <500 lines of code. You may notice that the code basically looks identical. That's because I wasn't using anything specific to a key-value store.


I wonder what is the motivation?

Did you identify a limit to ES ? Did you need a simple query builder for advanced cockroachdb search queries?

Whatever your initial needs (production or experimentations), please add a short explanation in the article! ;)


I've implemented a SQL database from scratch before (that one was only in-memory) so I wanted to try my hand at a document database and this time have it support real storage.

In general, posts on my personal blog are purely for fun and education. :)


What are you using to scale the document storage across nodes? Is it the usual master-slave stuff?


Just an educational proof-of-concept. Scaling not being part of that concept.


In my off time I was writing a document layer atop cassandra.

Then if you add a basic relation capability between documents, then all of a sudden a document database becomes a poor man's property-graph database. It also was just a bit of an side project without any real production use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: