Ask HN: Why should I use Elasticsearch instead of building from scratch - xkbd
======
ctvo
When did this place become such a low effort brain dump? I can't believe I'm
so annoyed by this question. Google. Make some educated trade-off decisions
based on your context.

Use HN to poll for opinions and experiences from others, not for things that
take 30 minutes to resolve.

~~~
elorm
As much OP did lazy work by posting this question, I don’t think your response
was fair or helpful in any way. In the end you didn’t add anything to the
conversation either.

~~~
ctvo
What can you add to this discussion? 0 context is provided. It's not even a
properly formed question. As someone who's familiar with Lucene, Solr and
ElasticSearch even if I wanted to help I can't.

Sometimes posts are shit, and it's OK to call it out to hopefully improve this
site collectively.

------
tedmiston
You should write one from scratch to get a deeper understanding of how hard it
is to return highly relevant results quickly. Tokenizing, stemming, bag of
words, and tf-idf for ranking get you to an MVP, but then you realize how good
production grade search engines are today.

Solr is good. I've been wanting to try Lunr [1] for small sites.

[1]:
[https://github.com/olivernn/lunr.js](https://github.com/olivernn/lunr.js)

~~~
mlthoughts2018
I worked in a company previously where Solr was used to scale the business,
and was not performant for us after a while.

We wrote our own search engine at that point. You are right that there are a
lot of little “devil in the details” issues. But overall it was a fun
experience.

This was needed to support some specific machine learning workflows in the
search ranking process — which could not be used if we paid the high latency
cost to first get preliminary results in Solr.

So we took a “create your own index data structures” approach with index data
(both the normalized bag of words vectors and companion data like boolean
filters), which allowed us to highly optimize the initial broad ranking query.
Latency was low enough that it allowed the time cost of calling follow-on
machine learning services.

This was for a fairly high-traffic product search engine at an online
retailer. It ended up working very well and over a span of about two years we
eventually rolled all search traffic onto the in-house platform, even the
parts not needing the machine learning services, and our query latencies went
down across all our traffic, and we retired the original Solr implementation.

Wouldn’t be the right choice for everyone, but it informs my opinion a lot
about the worthwhileness of creating an in-house search engine to specifically
replace Solr. I’d suspect a lot of medium-sized or large companies running
Solr should seriously consider it.

~~~
sova
Could you tell us more about the kinds of indices you used, and what you mean
by boolean filters? Thank you!

~~~
mlthoughts2018
Really it’s not fancy or anything. We used Eigen to represent our normalized
bag of words matrix (term-document matrix) as a sparse matrix in CSC and CSR
format (which means the data resides in three underlying arrays for the
nonzero entries, with indexing conventions for how to use them).

Boolean & multi-choice indices are just companion arrays where position i
corresponds to a property of document i in the index: boolean for binary
attributes (for example, whether the item has free shipping or not), or using
a bigger integer space to encode more options, like say an int8 coupled with
helper functions that check which bit is set, maybe for some set of 8
categories the items can be filtered by).

The “index” is just the serialized arrays backing the sparse matrix, the
arrays backing the filters, and helper functions for decoding what the filter
bits mean.

A query is then just applying the filters followed by performing the sparse
matrix inner product and sorting.

It’s very basic, but allows you to heavily optimize it, whether optimizing for
deletes, writes, certain heavily used filters, etc.

And you can of course add whatever fancy NLP stuff on top of or in place of
the sparse matrix as well.

~~~
sova
It sounds like a great approach to enable both speed in searching and
customizability in indexing -- no way you get to bit level box ticking with
conventional means. Thanks for explaining. Bag of Words, last I recall
reading, is a statistical method for predicting the next word, so I'm curious
how that played out for you.

~~~
mlthoughts2018
Bag of words is an encoding method for converting some sequence of words into
a long, sparse vector of word counts. The vector has one component for each
possible word of the vocabulary, and if the count of word i is N, then you
store N in conponent i of the bag of words vector.

If you think of these as sparse row vectors (the columns correspond to all
vocabulary entries), then you store them as a matrix where you stack on
another row for each “document” in your data set.

Later on when you get a new “document” at query time, you transform it into
the same bag of words vector format, and then an inner product between the
matrix and the query vector corresponds to a type of relevance / similarity
useful for sorting into a ranked order of results.

In practical situations you have to work harder, because you need more units
of text than just words (such as n-grams), and the raw term counts usually
need to be weighted (e.g. matching a 3-gram probably means more than matching
a single word) or normalized (e.g. longer documents happen to have more words,
but that doesn’t mean they are more similar), and you need to account for
results that are historically more popular or results that are newer.

It’s a very simple approach to document search, but it works well and there
are extensions that utilize word embeddings or models that predict rankings of
results.

Once you get a system running with the term-document matrix, it is a nice
platform for more advanced experimentation and machine learning feature
development.

------
jakelazaroff
Is this for work, and is search a competitive advantage for you? If not, snap
in Elasticsearch and spend your time on your differentiators.

------
dewey
With that little information from the OP the answer is probably: Use ES or if
it's a small side project use your DB's included full text search if it's good
enough.

~~~
danielecook
When is a full text search within a DB not good enough? Is ES usually used
along side a typical RDS or is it a replacement?

~~~
dewey
I don't have a lot of experience with PG full text search in production or a
bigger scale. I'd just suspected that it doesn't perform that well if you need
a lot of filters, range queries etc? Maybe someone with more experience can
chime in.

At work we just materialize the data from PG into ES and take advantage of the
powerful ES queries and redundancy. Scaling up by just adding nodes is easier.

~~~
dominotw
How do you verify that all the data moved into ES from pg and what got lost in
transit.

~~~
dewey
For our use case we don't really need a strict verification. Everything that
should go to ES is put in a queue (using DB triggers) and then sent off to ES.
If something should cause errors we'd see that by observing the error rate on
ES ingest I assume.

------
33degrees
Elasticsearch is incredibly deep, and highly performant. If all you need is
simple full text search then rolling your own can be an interesting exercise,
but I can't imagine the amount of hours it would take to replicate the
features I use on a daily basis.

~~~
dozzie
> Elasticsearch is incredibly deep, and highly performant

For some value of "highly performant". I remember its search (exact substring
match) being significantly slower than simply running grep on the same data
(JSON documents produced from syslog logs) stored in flat files.

It did have several advantages over grep in that scenario (e.g. having a
structured language and being accessible for other programs through network),
but performance was not one of them.

~~~
33degrees
Right: my experience is with a lot more complex scenarios, and in comparison
to rdbms. Things that would take multiple queries, like aggregations, can be
done in a single, fast, call. It does require a proper indexing setup and some
fine tuning though.

------
mikece
Have you written a search engine from scratch before? There's a reason this is
still a primary field for PhD work.

------
smilesnd
The simple answer is because Elasticsearch has had thousand of hours already
put into writing its code base. The real question should be why shouldn't you
use Elasticsearch? Is the code base to large to fit where you need it to be?
Will it be able to scale with your project? Is it efficient enough for your
requirements? When looking to use a piece of technology the requirements and
long term effects are what matter. Roll your own if that is what is required
for you to reach your end goal.

------
mlevental
what a weird question: who looks at a search engine and thinks yea hm that's
trivial enough i could do it myself in a weekend?

~~~
rjkennedy98
Seriously, creating an efficient scalable search engine is among the most
difficult computer science problems. From stemming, to combined queries, to
word tokenization, to handling various string collations, and language issues,
and caching, and parallelization of work, and handling huge numbers of writes,
there are so many tricky parts. I used to work for a search startup and I can
answer the OP's question: do not try to write your own search engine. That
work should be done by someone with a PhD and decades of experience. Even
elastic which is great software has issues, such as not being transactional,
having issues handling huge numbers of writes, ect.

~~~
sova
It depends on the complexity, which language(s) you are using, and how you
will parse the search string (which Context Free Grammar it will obey). Just
recently, this was a task for me that ended up landing me an interview. A
take-home problem. Even a simple search engine needs some clever reverse-
indexing for speed. Add any sort of logic like And or OR that not even Google
implements and now your parser has to work, and you have to be able to
translate from tokenized parse tree with operators to results. It's a great
learning exercise for someone with experience, initiative, and enough
background with CFG parsing, building a reverse index, and set logic -- but
without some key Computer Science building blocks it would end up being quite
a challenge.

------
anonfunction
To leverage the thousands of hours that went into it.

------
based2
What do you want ?

[https://db-engines.com/en/ranking](https://db-engines.com/en/ranking)

------
wallflower
Actually, the origin story of Elasticsearch started with Shay Banon attempting
to build a cooking app for his wife who was a chef.

> JAXenter: You started Compass, your first Lucene­-based technology, in 2004.
> Do you remember how and why you became interested in Lucene in the first
> place?

> Shay Banon: Reminiscing on Compass birth always puts a smile on my face.
> Compass, and my involvement with Lucene, started by chance. At the time, I
> was a newlywed that just moved to London to support my wife with her dream
> of becoming a chef. I was unemployed, and desperately in need of a job, so I
> decided to play around with “new age” technologies in order to get my skills
> more up­to­date. Playing around with new technologies only works when you
> are actually trying to build something, so I decided to build an app that my
> wife could use to capture all the cooking knowledge she was gathering during
> her chef lessons.

> I picked many different technologies for this cooking app, but at the core
> of it, in my mind, was a single search box where the cooking knowledge
> experience would start a single box where typing a concept, a thought, or an
> ingredient would start the path towards exploring what was possible.

> This quickly led me to Lucene, which was the defacto search library
> available for Java at the time. I got immersed in it, and Compass was born
> out of the effort of trying to simplify using Lucene in your typical Java
> applications (conceptually, it simply started as a “Hibernate” (Java ORM
> library) for Lucene).

> I got completely hooked with the project, and was working on it more than
> the cooking app itself, up to a point where it was taking most of my time. I
> decided to open source it a few months afterwards, and it immediately took
> off. Compass basically allowed users to easily map their domain model (the
> code that maps app/business concepts in a typical program) to Lucene, easily
> index them, and then easily search them.

> That freedom caused people to start to use Compass, and Lucene, in
> situations that were wonderfully unexpected. Imagine already having the
> model of a Trade in your financial app, one could easily index that Trade
> using Compass into Lucene, and then search for it. The freedom of searching
> across any aspect of a Trade allowed users to convey this freedom to their
> users, which proved to be an extremely powerful concept.

> Effectively, this allowed me to be in the front seat of talking and working
> with actual users that were discovering, as was I, the amazing power that
> search can have when it comes to delivering business value to their users.
> Oh, and btw, my wife is still waiting for that cooking app. Now, 10 years
> later, it is the basis of Elasticsearch.

[https://jaxenter.com/elasticsearch-founder-
interview-112677....](https://jaxenter.com/elasticsearch-founder-
interview-112677.html)

------
ian1321
Tough to answer w/o more info. FWIW, I've used Lucene, Solr, and Elasticsearch
and have ended up settling on Lucene being the best interface for me.

~~~
dajohnson89
I thought Lucene was the underlying query language, whereas Solr & ES just
utilized both...

~~~
rjkennedy98
Yeah I'm not sure what the OP is talking about. Lucene is the java search
library that Elastic uses. Elastic is a full clustered search engine with HA,
sharding, and a rest api. They aren't exactly interchangeable.

------
xellisx
Sphinx is another full text search engine.

------
hiroshi3110
How about implementing a search engine on top of key-value store like
FoundationDB?

------
courtneycouch0
Definitely build it completely from scratch. You should roll your own TCP
libraries as well. Don't trust anything you didn't write yourself. Come to
think of it, I'm not sure you should rely on someone else's hardware either.

