
Elasticsearch from the Bottom Up (2013) - bobjordan
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
======
jjwiseman
Back in 2006 I wanted to learn more about how search engines worked, so I
started porting Lucene to Common Lisp. Actually, I wrote a Common Lisp port of
Ferret. Ferret is a Ruby port of Lucene. Lucene is sort of Doug Cutting's Java
version of Text Database (TDB), which he and Jan Pedersen developed at Xerox
PARC, and which, to complete the circle, was written in Common Lisp.

I didn't know Ruby, and I didn't know search engines, but I did know Common
Lisp. It took me 7 months to create a binary-compatible, pretty dang
functional port of Lucene. I called it Montezuma[1], and it's still actually
used by people.

About half the time was spent implementing the text analyzer, document store,
and indices. The remaining 50% was spent implementing search (and parsing the
query language). It was a very rewarding experience--It was one of the largest
projects I'd worked on, mostly solo (I got some help near the end), in an area
I knew nothing about, working in an extremely test-driven fashion (over 2000
unit tests when I was done).

I did not, however, learn as much about indexing and search as I expected. I
learned a lot, yes, but quite a bit of the Ruby code was so easy to translate
to Common Lisp that I didn't have to understand everything in order to make it
work.

I still recommend writing a small search engine as an interesting exercise,
though.

1\.
[https://github.com/sharplispers/montezuma](https://github.com/sharplispers/montezuma)

~~~
ignoramous
A footnote: For Java devs out there, the Lucene codebase is a joy to read [0].
Esp, the APIs. Highly recommend it just for the documentation which is top-
notch. Michael McCandless, Lucene's committer-in-chief, sometimes blogs about
its internals [1].

[0] [https://github.com/apache/lucene-
solr/tree/master/lucene](https://github.com/apache/lucene-
solr/tree/master/lucene)

[1] [http://blog.mikemccandless.com/2019/10/concurrent-query-
exec...](http://blog.mikemccandless.com/2019/10/concurrent-query-execution-in-
apache.html)

------
brasetvik
Hey. I wrote that a long time ago. Funny to see it re-surface and glad it's
helpful. :)

There is a presentation version of it here:
[https://www.youtube.com/watch?v=PpX7J-G2PEo&feature=youtu.be](https://www.youtube.com/watch?v=PpX7J-G2PEo&feature=youtu.be)
(I made the presentation first and then wrote the blog posts)

I wrote a follow-up at called Elasticsearch from the Top Down here:
[https://www.elastic.co/blog/found-elasticsearch-top-
down](https://www.elastic.co/blog/found-elasticsearch-top-down)

~~~
bobjordan
Nicely written posts that helped me better understand what's going on, thanks
for writing.

------
bratao
Shameless plug from someone who want this project to flourish. Check
[https://vespa.ai](https://vespa.ai) as an alternative to Elasticsearch.
Migrating from a ES to it, I got a faster search, never had to face a
unhealthy node and native tensor support (And Native ANN is coming soon
[https://github.com/vespa-engine/vespa/issues/9747](https://github.com/vespa-
engine/vespa/issues/9747)).

Very mature, and still progressing at a neck-break rate
([https://blog.vespa.ai](https://blog.vespa.ai))

~~~
atombender
Vespa looks pretty good, at least in terms of performance and operation. I've
been evaluating it myself. I'm less happy about everything else.

It's got a mishmash of odd APIs, lots of XML, several query languages, lots of
weird little quirks. It doesn't feel modern. It's pretty clear that this is
originally an in-house project, developed over many years by many people,
where not as much effort has been spent on consistent/cohesive design or
documentation.

One rough area is the approach to schemas and indexing. Rather than let you
define a "clean" schema and put in _your_ data and then have Vespa index it in
all the ways it knows about, you're forced to essentially reshape your data
into a format compatible with Vespa, which brings with it some severe
restrictions. For example, Vespa will not index arbitrarily nested structured
data. If you have something like {categories: [{id: 1}]}, Vespa will not index
that. You have to flatten any array data to the top level. Nested maps and
arrays are mostly not supported, although it's hard to tell from the
documentation what is supported.

Vespa is also very obviously skewed toward ranking, not filtering. You can't
search by exact string matching: You can't do something like "topic = 'news'".
You only get case insensitive substring search. It's got lot of ranking
functions but very little that's optimized for filtering.

Overall, I'm a bit surprised that Vespa's authors position it as an
Elasticsearch competitor, because you certainly cannot just port an app that
uses ES over to it.

To be sure, it's got lots of interesting features such as ML integration, and,
again, performance and clustering design seems good. But it still feels very
much like a niche product.

~~~
bratao
I migrated from ES and for me, I do not agree about the feeling that it
doesn't feel modern. The Middleware logic container and Live reconfiguration
it is mind blowing. About those two things:

\- Nested (For my use cases, this is a problem I do not have. For more complex
cases, there is parent-child [https://blog.vespa.ai/post/174589826190/parent-
child-in-vesp...](https://blog.vespa.ai/post/174589826190/parent-child-in-
vespa))

\- Exact match ( use the exact match
[https://docs.vespa.ai/documentation/reference/search-
definit...](https://docs.vespa.ai/documentation/reference/search-definitions-
reference.html#match) )

~~~
atombender
By modern I mean the approach to configuring and running, and the myriad of
languages used: Antiquated XML for some things, a homegrown DSL for others,
JSON for query results, then multiple languages for expressing various parts
of the query -- it's pretty chaotic.

Another thing that felt antiquated: The whole notion of uploading an
"application". I can appreciate the benefits of controlling the lifecycle of
the configuration and have Vespa distribute it to nodes. But when you start
out, that "application" is just one or two files, and yet you have to create a
whole directory structure for it, as opposed to just POSTing individual
configs to REST endpoints like you can do with ES. The heavy-handedness of it
feels very "Java".

The document you linked to is a different type of exact match. I've been
through this, and even posted a Github issue. Mysteriously, a Vespa developer
replied that nobody had ever needed exact string matching, so nobody had
bothered to implement it.

Parent/child is not applicable to what I was talking about, I think. I'm not
talking about hierarchical relationships.

For my part, most of my work is in structured data, not text or vector-based
ranking, and Vespa really doesn't seem to be designed for that.

ES also has a very, very good aggregation API. Vespa's aggregation syntax is
odd and seemingly much more limited.

------
misterman0
I used to use Lucene back in the 1.x days when a fuzzy search was a complete
table scan. It was quite a surprise to see how your single term fuzzy query
was interpreted as one term query for each fuzzy hit OR-ed together. The
Lucene team soon realized they needed to code a levenstein automaton but none
of them had ever done that before. They pulled several all-nighters reading
math papers and coding and when they succeeded they were so happy they told
the world about it [0]. It's a great story.

[https://dzone.com/articles/lucenes-
fuzzyquery-100-times](https://dzone.com/articles/lucenes-fuzzyquery-100-times)

------
pixelmonkey
There's also a YouTube recording of a talk with similar content by the same
author from EuroPython 2014. Helped me out when I was adopting ES at scale in
that time period. (And the principles are pretty timeless to modern ES, too.)

[https://youtu.be/PpX7J-G2PEo](https://youtu.be/PpX7J-G2PEo)

If you like this, you might also enjoy my deep dive on Lucene (the indexing
technology underneath Elasticsearch) in "Lucene: The Good Parts":

[https://blog.parse.ly/post/1691/lucene/?utm_source=hn](https://blog.parse.ly/post/1691/lucene/?utm_source=hn)

------
MuffinFlavored
Does Elasticsearch need to be as complicated as it is?

I was surprised to find there wasn't an Elasticsearch + Kibana competitor that
is "simpler".

I just want to be able to store JSON logs with a timestamp + a bunch of fields
then search them in a nice little UI later. Apparently, that's pretty hard to
do right.

~~~
mountaineer
> JSON logs with a timestamp + a bunch of fields then search them

There is S3+Athena for this with AWS and Google can store/query JSON with
BigQuery. The nice little UI doesn't come with it, but at least you don't have
to spin up an Elastic cluster.

~~~
PretzelFisch
Do you have a S3 + Athena example because when I tried it didn't seem to query
adhock json file but rather the S3 file needs contain a array of json
documents.

------
inertiatic
One thing that really bothers me about ES is that compared to Solr, some terms
that have a specific meaning in Lucene are either not used for the
corresponding concept or even worse re-used for a different one.

It sometimes makes explaining the underlying implementation a bit harder to
people who are Lucene-agnostic but are ES users, with no good reason apart
from, I would guess, brand differentiation?

------
drej
I can super highly recommend talks by Adrien Grand on Lucene and
Elasticsearch. They are super instructive yet easy to follow.

[https://www.youtube.com/watch?v=T5RmMNDR5XI](https://www.youtube.com/watch?v=T5RmMNDR5XI)

[https://www.youtube.com/watch?v=eQ-
rXP-D80U](https://www.youtube.com/watch?v=eQ-rXP-D80U)

