
Ask HN: Are Lucene/Solr/ES Still Used for Search? - lovelearning
I casually visit jobs&#x2F;freelancing sites once in a while. I don&#x27;t see as much demand for Lucene&#x2F;Solr&#x2F;ES skills for website &#x2F; text &#x2F; document search or other kinds of information retrieval, as I used to about 4-5 years ago.<p>ES seems to be the most popular but only in its ELK avatar for devops dashboards.<p>What technologies are you people using for text or document or website search nowadays?
======
crawdog
Very much so. For retail/catalog search SOLR dominates. There's a lot more
customization available for relevancy/ranking OOB than Elastic. Drawbacks are
managing indexing - SOLR cloud is much harder to manage.

For commodity search workloads (general retrieval/faceting) Elastic does a
fine job. It scales well and there is good documentation and support.

Lucene is the core engine behind both of these solutions.

For fun, lets look at the large Enterprise acquisitions over the years:

* Verity - bought by Autonomy

* Fast - bought by Microsoft (Also known as the Enron of Norway...)

* Autonomy - bought by HP (Look at the backstory on this deal!)

* Endeca - bought by Oracle

* Vivisimo - bought by Oracle

* Google - GSA (now Google Cloud Search, hosted solution)

Next, follow the path of online acquisitions:

* IndexTank - bought by LinkedIn

* Swiftype - bought by Elastic

There's a number of interesting independent players still. Coveo plays in the
Enterprise space, but it's a hard market. Algolia is doing great in the
commodity online search space and seems to be growing well.

This is an area I think is open to more competition. Especially with AI/ML
technologies available around Document Understanding - the Enterprise market
is open for a good on-prem upstart to really take off.

Ping me offline if you have additional questions - spent almost 20 years in
the space and ran a search company of my own.

~~~
aeyes
* Fast - bought by Microsoft (Also known as the Enron of Norway...)

That one was painful to live through, we got forced to migrate to Windows and
everything went sideways. That was almost 10 years ago with quite a big
cluster (tens of nodes).

~~~
tallanvor
You were never forced to migrate to Windows. In fact, the last major customers
on ESP were using Linux to the end.

------
jillesvangurp
Elasticsearch, definitely. I always recommend using it in hosted form and not
running your own cluster. That allows you to focus on getting data in and out
of your cluster instead of sinking time into doing devops.

While I have not used Solr recently, it has evolved along with ES as a solid
product with a solid community. Nothing against it; it's a solid choice and
there are probably people offering to host that as well. Either way you are
using Lucene. Using Lucene directly makes no sense unless you know what you
are doing and have a need to do that. If you have to ask that means it's not
for you.

If search is important to your use case and things like relevance, precision,
and recall have real impact on your business, you should get some specialists
involved and not reinvent the wheel or make all the rookie mistakes. Somebody
like me basically ;-). However, that can be expensive and if search is not
that critical, just sign up for one of the search as a plug in solution type
products out there and don't bother running a lot of infrastructure. E.g.
Elastic offers a thing called App Search and it probably covers most of the
simple needs and is stupidly easy to get started with. There are several
competing products probably that I can't vouch for.

You can always upgrade to something proper later. Things like mongo and
postgres also have some limited capabilities here and you can get away with
doing some simplistic stuff with sql even. However, there's a point where you
hit a brick wall and there's something you need that is simply hard/impossible
using that or where you end up reinventing a lot of stuff that things like
Elasticsearch probably do better.

~~~
busterarm
> Elasticsearch, definitely. I always recommend using it in hosted form and
> not running your own cluster. That allows you to focus on getting data in
> and out of your cluster instead of sinking time into doing devops.

I think everyone doing Elasticsearch well has to bring it in-house eventually.
AWS's hosted solution is poor, Logz.io and ElasticCloud are expensive.

There's a 7-figure/yr Elastic Cloud customer I work with who is so tired of
Elastic just randomly killing their clusters out of nowhere and having to
spend basically triple to deal with it that they're bringing it all in house.

~~~
ragle
> AWS's hosted solution is poor

What has been poor about it in your experience?

~~~
__blockcipher__
Not the GP, but you aren't allowed to touch settings like the shard recovery
rate.

If a configuration change (changing # of nodes, instance type, etc) goes
wrong, your cluster indefinitely gets stuck in Processing due to a race
condition. The only way to get unstuck is to file a ticket. The company I'm at
doesn't pay for AWS support, so at one point we ended up completely tearing
down our cluster and rebuilding a new one (via Terraform) after getting tired
of waiting for the reps. (They advised us to cut off log flow to let the
system get out of processing, which we did, but it didn't work because once it
gets stuck in processing like that it's just completely stuck).

It's difficult to troubleshoot issues - you can get some logs via Cloudwatch,
but they're hard to search through and I'm not entirely positive everything
shows up there.

Amazon is always several releases behind Elasticsearch versions.

\--

Elastic.co's offering looks much better, just by reading their excellent
comparison article: [https://www.elastic.co/blog/hosted-elasticsearch-
services-ro...](https://www.elastic.co/blog/hosted-elasticsearch-services-
roundup-elastic-cloud-and-amazon-elasticsearch-service)

(We haven't used Elastic.co but what they say makes sense and I imagine their
service is much better)

\--

Once you hit a big enough scale - for us, we're pushing about 2TB a day of
logs (that's before accounting for replication of course), it doesn't make
sense to stay on Amazon's hosted service.

I'm in the process of advocating for in-housing our Elasticsearch setup and
just building on top of ec2. Elasticsearh seems like the perfect candidate for
Kubernetes since rebalancing is automatic and the affinity rules are simple
(every Elasticsearch instance needs its own node). Cluster autoscaling (i.e.
node-level, not _horizontal_) just makes too much sense.

Unfortunately I haven't gotten the go-ahead to take it inhouse, but I've been
gunning for the project for some time now, so I'm hopeful I'll get the
opportunity.

\--

BTW, totally unrelated but for anyone managing Elasticsearch, make sure you
have your shard count tuned properly. When I came to this company, they had
their data way oversharded with primary shards varying between hundreds of kb
to a few gb; i.e. orders of magnitude difference. Switching to ~50 GB shards
(done via simplifying the way we were indexing) massively improved
performance.

Also i3 instances > anything with EBS.

[/ramble]

~~~
jillesvangurp
Check out the recently open sourced kubernetes scripts:
[https://www.elastic.co/elasticsearch-
kubernetes](https://www.elastic.co/elasticsearch-kubernetes). There's no need
to reinvent all of that.

ES is indeed super solid if you know what you are doing. However, most users
getting started with this are probably going to find out a few things the hard
way though; which is why I recommend hosted solutions as it removes quite a
bit of non trivial devops from the equation.

------
arafalov
AFAIK, Reddit, Slack, Dice, Bloomberg, IBM, Apple all use Solr. Jira and
Confluence use Lucene. Others use Elasticsearch and Fusion (commercial product
on top of Solr). See, for example: [https://www.activate-conf.com/more-
events](https://www.activate-conf.com/more-events) for presentations from
several past years on who and how uses Lucene/Solr.

Also, the new trend in jobs is "Relevancy Engineering", which is less about
just setting up search engines and more on actually tuning them. That's where
also Machine Learning and other AI techniques come in (Learning to Rank, Named
Entity Recognition, sentiment analysis, etc.). Which was recognized by
rebranding of the conference from Lucene/Solr revolution to Activate last
year.

See also Haystack conference which focuses very specifically on relevance
regardless of the specific search engine:
[https://haystackconf.com/](https://haystackconf.com/)

~~~
peterdm
Until about 2014-2015, many large companies wouldn't have looked twice at
Elasticsearch. The companies you list using Solr have been invested in search
for 10 years or longer (predating Elasticsearch), and may have high switching
costs.

~~~
arafalov
For sure they would have high switching costs. They would be switching from an
open-source (and free product) to which they contributed changes to a
commercial product. License alone would be a serious discussion point. So,
that's a fact.

Was there an opinion in there as well that you tried to convey?

------
bratao
From my experience, yes. Lucene is a "production" state of art library and
Solr/Elasticsearch is very used in many scenarios.

This expertise is very on demand.

My company personally migrated from ElasticSearch to
[https://vespa.ai/](https://vespa.ai/) and could not be happier. Faster and
way easier to maintain a cluster. The "Application Packages" feature present
in Vespa opened many opportunities to improve our product( Curiously we use
Lucene inside our custom application for a "Search map" functionality, for
something like that [https://www.lexisnexis.com/en-us/products/lexis-
advance/sear...](https://www.lexisnexis.com/en-us/products/lexis-
advance/search-term-maps.page) ) . I highly recommend it!

~~~
atombender
I've looked at Vespa a bit. It looks pretty good.

It's also readily apparent that it's an ancient system that's grown out of an
in-house project, and that its design has accrued a lot of oddities over the
years from lack of careful, co-ordinated design. It includes a bunch of
esoteric features (like the "predicate" function) that have obviously grown
out of Yahoo's own internal architecture.

One curiosity is Vespa's approach to schema and configuration changes. To make
any kind of change, or indeed set up an index, you have to create that
"application package" containing your schema and configuration in the form of
files, and then use separate REST APIs to "upload", "prepare" and "activate"
it. There's a CLI tool to help perform those steps, at least.

It's nice that they're more consistent and rigid about schema and config
evolution than Elasticsearch. But it's not exactly operator-friendly, at least
not for first-time users with no pre-existing operations based around Vespa.

The package design also makes it more cumbersome to perform programmatic
updates for a schema. I once worked on a SaaS project where we indexed data in
Elasticsearch — arbitrary documents where we didn't know the schema ahead of
time, because we just accepted any JSON document posted by the client. With
ES, we could just use its dynamic mapping support, which automatically creates
field definitions when new fields arrive (using regex-based templates). Do you
know how long a package update takes in Vespa, to add, say, a single field?

The Vespa documentation is also pretty terrible, in my opinion. They explain a
lot of things, but it's confusingly written, uses a lot of homegrown
terminology, and neglects to collect all the reference documentation in one
place. For example, you can't find an overview of the entire API — fragments
of it are just scattered across a dozen or so unrelated pages.

Lastly, Vespa is Java. One of the biggest challenges maintaining Elasticsearch
is controlling its resource consumption. You have to give it a lot of RAM, and
it's never clear how much it needs and what configuration settings and usage
patterns affect its memory use. Tuning it is something of a dark art. I don't
know exactly how Vespa is implemented (is it all pure Java?), but I'm worried
that, being a JVM app, it has the same shortcomings.

~~~
RealJon
Predicate fields are indeed an oddity, but not an architectural one - it's for
situations where the documents need to specify criteria (predicates) for when
they should match - like only match for certain users, certain times of day
etc. It's probably an underused feature imho since most people don't know this
can be done efficiently.

If you have dynamic fields like in your SaaS example I recommend using a
single map field rather than let data not under your control drive changes to
the set of fields.

> Do you know how long a package update takes in Vespa, to add, say, a single
> field?

A few seconds. However, rather than having operators do any of this manually,
set up an automatic process which deploys on each change made to the repo (i.e
do CD).

> all the reference documentation in one place

[https://docs.vespa.ai/documentation/api.html](https://docs.vespa.ai/documentation/api.html)

~~~
atombender
Another thing is that Vespa doesn't seem to support indexing of nested data,
either structs or arrays of structs. For example:

    
    
      {
        "location": {
          "city": "Washington",
          "state: "District of Columbia"
        },
        "friends": [
          {"firstName: "Bill", "lastName": "Clinton"}
        ]
      }
    

Maps aren't suitable here because they can't be used for ranking. So you have
to use structs, but those aren't indexable.

An application's search module could flatten the location key (e.g.
"location_city", "location_state") for simple attributes, but the same is not
possible for the array, since there can be arbitrary array elements. And you
can't split it to an array of strings:

    
    
      "friends_firstName_elems": ["Bill"]
      "friends_lastName_elems": ["Clinton"]
    

...because queries like "firstName contains 'Bill' and lastName contains
'Clinton'" could match different records ("Bill Bryson" and "George Clinton").
Never mind deeply nested arrays of objects containing arrays containing
objects containing arrays.

This seems unnecessarily restrictive. A search engine should be able to index
the data you already have, not force the application to contort its data to
whatever shape the engine requires.

Is there no way around this?

------
mlthoughts2018
I work professionally in this space and I can say over the past ~10 years, my
full time job at 4 companies has been almost entirely to migrate away from
SOLR / Lucene solutions and implement custom in-house search indexes.

Most of the time it has been for performance reasons. SOLR / Lucene have very
poor performance characteristics, especially when needing to support custom
sort ordering and heavy use of filters.

On other occasions it has been because you can’t easily extend search indexes
to more advanced use cases, like similarity-based reverse item search,
collaborative filtering, more advanced treatment of cold start issues / bias
towards existing popular content / trending search.

A lot of small or medium sized companies naively figure they’ll just use SOLR
etc to get something out of the box, or for side channel problems that are
smaller scale.

But you come to regret it pretty fast because you end up needing one
standardized way to build and deploy search indices and it has to support all
the bells and whistles that SOLR can’t _and_ be faster than SOLR, for the big
product use cases.

A lot of product companies now are beginning to use word-vector approaches
with nearest neighbor libraries like ANNOY, as the first solution instead of
the solution you eventually have to migrate to when you realize SOLR does not
actually support your use case, not even as a means to get it up & running
quickly.

~~~
peterdm
Are you seeing a blueprint start to emerge for a _standardized way to build
and deploy search indices_ in the context of applications that need vector-
space features? (E.g. if you start with ANNOY, you get kNN but then how do you
add in the ability to refine, filter, rescore, sort with text, etc?)

~~~
mlthoughts2018
I actually have seen the opposite trend. You don’t want to standardize the
search engine and that has been a major problem with SOLR.

Instead, you want to custom build the data retrieval system so it’s tailored
to your use case.

One example from experience was needing to add hard filterable metadata to an
in-house search index. We solved this by actually calculating bit masks that
represented all the filtering criteria and having a frontend preprocessor that
would first restrict to the filtered subset and then do TFIDF-based relevance
sorting.

Creating the bit mask tooling ourselves (instead of relying on whatever baked-
in method of scanning items and filtering that comes with out of the box
search engine tools) allowed us complete control over the trade-offs,
particularly managing document deletions and optimizing run time performance
in certain ways that just weren’t available in out of the box tools, as well
as being able to integrate any in-house code into the search engine as needed
(since the whole system was in-house code).

You want to create data models that are highly application specific, and then
route data into them. The mistaken approach of one-size-fits-all tools,
especially in information retrieval, is to pre-define the supported behavior
of the application, like a web service wrapping a search index, with baked-in
assumptions about the trade-offs and only limited support to modify or
configure the trade-offs under the hood.

The gravest mistake is thinking just because your use case seems to function
OK with those assumptions now, that you can marry yourself to the underlying
data model. Then in the future you’ll hit the point where you have to throw it
away and create something custom, but it will be far more costly to do so and
extremely hard to migrate gracefully and ensure integrations are working.

------
lm28469
Afaik almost everything runs Lucene under the hood, it's 20 years old, no one
is going to build something as good any time soon. I suppose some company like
Google have their own in house solution but otherwise it'll always be
something built on top of Lucene.

I guess you don't see much demand because for a lot of use cases the basic
setups are good enough.

~~~
tombert
About seven years ago, I got a contracting gig for a website that wanted a
"search engine". I remember thinking "Solr/Lucene is old, not pure-functional,
and therefore awful!" and decided to build my own. Somehow I even managed to
convince the client that this was a good idea.

I ended up trying to reinvent Solr for the client, realizing after about two
days of trying to reinvent stemming and indexing, that this was stupid to do
on someone else's time, and called the client to tell them that I'm moving to
Solr, and I got the project done before-schedule as a result.

====

I think for 99% of usecases (involving search), Lucene/Solr/ES is perfectly
fine. However, I do absolutely hate that some companies have decided to make
it their primary database.

EDIT: I just want to make it clear, I think it's totally valid to try and
reinvent Solr for fun, or if that's something you're paid specifically to do;
nothing is perfect, and I am actually a big fan of the "if it works, break it
and make it better!" mentality.

~~~
eivarv
Agreed re: using ES as primary storage (which it is NOT meant as) - as far as
I can tell, it might even make you in breach of the GDPR [0].

TLDR: Lucene < 7.5 won't merge segments larger than 5GB (default) unless they
accumulate 50% deletions.

Delivering a conference talk [1] later this year about it.

[0]:
[https://www.eivindarvesen.com/blog/2018/09/16/elasticsearch-...](https://www.eivindarvesen.com/blog/2018/09/16/elasticsearch-
and-gdpr)

[1]:
[https://2019.javazone.no/program/3f7cd8a7-a9ea-4874-a7dd-531...](https://2019.javazone.no/program/3f7cd8a7-a9ea-4874-a7dd-53166ada9f08)

~~~
balfirevic
> Agreed re: using ES as primary storage (which it is NOT meant as) - as far
> as I can tell, it might even make you in breach of the GDPR [0].

How is GDPR compliance of having data in Elastic Search influenced by it being
primary vs. secondary storage?

~~~
rickmode
See my reply to bryanrasmussen for a full explanation.

Basically: you reindex ES periodically, so when a user is deleted from the
primary, it will disappear from ES upon the next reindex. The old index is
deleted at the file system level.

~~~
ipython
At some point, though, the pedantry can get out of hand. After all, 'deleting'
at the file system level is just 'unlinking' the inode from the underlying
data blocks... in fact, data forensics at the file system level is probably
more well-understood than recovering deleted data from a Lucene shard.

at what point would you be able to 'delete' data without being in violation of
GDPR?

~~~
eivarv
I know - it really is a matter of definition.

Though the EU has said it will consider intention etc. there's really no way
of knowing for certain until when and if it's settled in a court case.

------
matt_heimer
The technologies are still heavily used but there might be slightly less
demand for the skill set because:

1) Cloud search service - You are less likely to deal with setting up your own
instance and concerns that go with it (sharding, etc) because most Cloud
providers offer either ElasticSearch as a service or some form of turnkey
deployment. You still have to do some low-level work like score manipulation
but you don't deal with as much administration.

2) Search as a Service -
[https://en.wikipedia.org/wiki/Search_as_a_service](https://en.wikipedia.org/wiki/Search_as_a_service).
There are several companies that provide a Search SaaS offering. Typically
they provide value adds above just running your ES service. Often they will
provide web crawlers so you can just point them to your domain or they might
provide other datasource integrations like pulling content from a database.
You get access to Solr/ES functionality if you want it but you can get search
running without going to that level if desired.

Either way a Lucene based stack is still in use.

------
linsomniac
About a year ago I set up an ES cluster to load Apache logs into as an
experiment. Around a month later my boss asked if it was down. "Yeah, it looks
like it, since you are using it let me upgrade it from an experiment to
critical!" Since then we've been using it in more and more places, and are
looking to see if it would be a good fit for storing some of our user data in.
The big barrier right now is that we are running the version of SQLServer
right before changed data capture became part of the standard edition, and
that's the way we'd probably prefer to synchronize data into ES from our
primary database. We have a couple home grown solutions to synchronizing
secondary data sources, and we'd like to get out of that business. The SQL
server is always going to be our source of truth, probably.

~~~
parthdesai
>The SQL server is always going to be our source of truth, probably.

ES is not supposed to be your source of truth.

~~~
ps101
Can you elaborate on the reasons for this?

~~~
Izkata
Coming from experience with Solr, the answer is far simpler than the link in
the sibling comment indicates: Data is processed and manipulated on import,
and it's difficult - sometimes impossible, depending on the field config - to
get that data back out in the format it was imported.

Any changes to the schema, such as switching fields between indexed/not
indexed/stored/not stored, requires reimporting the data to populate those
fields, data which you're not likely to have if it was your primary store.

~~~
manigandham
Elasticsearch has a __source_ field [1] that stores the entire original
document and is enabled by default. It's required to support features like
highlighting in results. ES also has a _reindex_ API that specifically makes
use of this [2].

1\.
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-
source-field.html)

2\.
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-
reindex.html)

------
be_erik
Search gets less attention these days, but mostly because the current tools
work so well. Scaling Elastic is still a bit of a dark art, but our company
likely wouldn't exist without lucene/elastic. They take a bit to learn and use
correctly, but they are incredibly powerful.

~~~
shhsshs
What do you mean by saying scaling Elastic is a dark art? Care to expand on
that?

~~~
meddlin
I'm by no means heavily experienced in this, but my current employer has a big
demand for this. (e-commerce)

First there's the Docker + Kubernetes architecture that ES lends itself to
really well. Then (depending on your use-case) there are concerns like
hot/warm architecture, node types, ETL/indexing processes. ES recently moved
over to openJDK, so there's a couple intricacies there (i.e. JVM heap size)

Then, there's document/query structure. In no particular order:

\- Do you have any parent/child relationships?

\- Do you have stop-word lists developed?

\- Can search templates help your queries?

\- How will you interface with ES? It has REST APIs, but it's recommended to
not expose ES directly to your applications.

\- Some advanced querying possibilities like customizing tokenizers,
normalizers, and a bit of internationalization.

\- Oh, we haven't even discussed security yet.

\- Also, ES isn't meant to be a primary data storage. This is more so a
"cache", but not quite like Redis. So, you'll need a DB elsewhere most of the
time.

All of this changes depending on if you're using it for SIEM, e-commerce,
AI/ML, etc. Also, Elastic now provides their own SIEM solution, a pre-built
search solution (AppSearch + Search UI), built-in security features. Check out
the new ES 7.2 update; it's kinda nuts.

~~~
vosper
> ES recently moved over to openJDK, so there's a couple intricacies there
> (i.e. JVM heap size)

My current employers uses ES - we're on 6.8, planning to move to 7 in a few
months. Judging by the other replies here I'd say we have a reasonably large
cluster (150+ i3.2xlarge instances, billions of documents), so tuning the
cluster is very relevant to us. Could you expand on how things have changed
with the move to OpenJDK?

I've seen some claims online that, contrary to what Elastic recommends in
their docs, a few machines with huge heaps (100+ gb) is the way to go, rather
than many machines with 20gb heaps.

~~~
yehosef
>I've seen some claims online that, contrary to what Elastic recommends in
their docs, a few machines with huge heaps (100+ gb) is the way to go, rather
than many machines with 20gb heaps.

Usually the recommendation is less than 32GB - this link has some more
discussion about it: From [https://discuss.elastic.co/t/es-lucene-32gb-heap-
myth-or-fac...](https://discuss.elastic.co/t/es-lucene-32gb-heap-myth-or-
fact/22920/9)

It seems whether it's better or worse depends on your data set . But I would
love to see tests of different kinds of workloads with large or smaller heaps.

------
m-i-l
Within my current organisation, two main search options are Solr and
Elasticsearch, both based on Lucene.

Generalising a bit, Solr is more targetted at enterprise search and
unstructured content search (e.g. bundled with most content management
systems), and Elasticsearch is more targetted at data analytics and structured
data search (e.g. with the ELK stack). Again simplifying a bit, Solr can be a
bit more configurable and/or work better with the types of data that benefit
from more configuration, and Elasticsearch can work better with the types of
data that work more "out of the box".

I'd agree that there don't seem to be a huge number of openings for specialist
search roles or a huge number of people specialising in search, but it is
often part of another role and there are often people who have touched on
search in their roles. That suggests that many people are just using it with
largely default setups. Having said that, things like advanced relevancy
tuning, if you need it, is a very much under appreciated skillset, and
definitely needs someone with good experience or ability to learn.

------
garysieling
I'm using Solr for
[https://www.findlectures.com](https://www.findlectures.com), but I think
Vespa looks interesting - lets you store feature vectors in the index, so you
can do neat things to incorporate ML algorithms in ranking.

~~~
fogetti
That's right! Vespa looks cool. I really appreciate that they even implemented
a use case as a proof of concept.

------
simplechris
We do at Vimeo: I gave a talk this week at the NY Elasticsearch meetup about a
new search product we built using ES. If you're interested you can access the
recording of the livestream here:
[https://vimeo.com/348443979](https://vimeo.com/348443979) The product in
question is [https://vimeo.com/stock](https://vimeo.com/stock)

------
softwaredoug
Yes but the search market has shifted.

We're at a point where Lucene and family are used for increasingly
sophisticated use cases. The commodity end of the market used to be dominated
by open source (Solr connected to Drupal for example)

Now for commodity sites there’s so many SaaS search products it doesn’t make
as much sense to hook up Solr or ES to make your blog or university website or
whatever searchable. A lot of basic search use cases are covered by products
where you don’t want to have to hire a team to manage search.

But at the higher end apps with search, customization, especially of doing
domain specific relevance at scale, is often a product differentiator (but
often not _so_ important or weird you should write your own engine). So this
is where these systems thrive...

~~~
ageyfman
100% agreed. This is what my company uses ES for, and it's exceptional at it.

------
nova22033
SOLR is great but it's a pain to manage it in the cloud. If you lose an EC2
instance, there is manual work involved when you bring up a new instance. You
have to tell the new instance servers which shards they're going to replicate.
If the EC2 instance hosting shard1 replica2 goes down, you can't just bring up
a new instance and have it be replica2. You need to use the API(which is just
a call to a bunch of URLs) to get the new instance to be part of shard1. Also,
a good cloud overview UI would be nice. 8.1.1 does have some improvements.

Also, SOLR speed is almost directly proportional to disk speed. If you index
is on solid state drives with high iops, you'll be fine.

Backing up a large index is a little painful too.

------
kostarelo
We have been using [https://www.algolia.com/](https://www.algolia.com/)
completely as a replacement for ES.

Pros:

\- Managed search engine

\- Great API / Developer experience

Cons:

\- Cloud only makes it hard for local development

\- Expensive (I guess it depends on the usage)

~~~
np_tedious
Worth noting this is what hacker news itself uses

~~~
kayoone
you mean [https://hn.algolia.com/](https://hn.algolia.com/)? thats not made by
the HN team though, wouldn't be surprised if the algolia guys built it to show
their tech, which is a very smart idea anyway.

------
acdha
I use Solr & ElasticSearch heavily — they're “boring” in the sense that they
do a lot of heavy lifting without many surprises and they scale easily into at
least terabyte-sized indexes.

One area where this might be less true is that the full-text search in
Postgres & MySQL have matured to the point where some basic applications might
reasonably decide that it's not worth using a separate service.

------
manigandham
Elasticsearch is very popular because it works well for generic searching and
can be customized for lots of unique scenarios. There's competition on the
infrastructure side using something other than Java/JVM though:

For Rust, there's Toshi: [https://github.com/toshi-
search/Toshi](https://github.com/toshi-search/Toshi) which is built on top of
Tantivy: [https://github.com/tantivy-
search/tantivy](https://github.com/tantivy-search/tantivy)

For C++, there's Xapiland:
[https://github.com/Kronuz/Xapiand](https://github.com/Kronuz/Xapiand)

For Go, there's Blast:
[https://github.com/mosuka/blast](https://github.com/mosuka/blast) built on
Bleve:
[https://github.com/blevesearch/bleve](https://github.com/blevesearch/bleve)

------
Gonzih
Yes. They might not look trendy anymore, but they are still heavily used in
the industry. I constantly see in the industry use cases where sold or ES
would be much better choice, but those options are simply ignored because they
are rarely visible at the top of tech publications.

------
blyry
We use elasticsearch to power our ecommerce search, and it works pretty well,
but we're considering moving to a commercial product, or solr, to get closer
to personalized results based on our knowledge about the user.

We just rewrote our internal search API from a windows service indexer with
lucene indices and a vb.net SOAP api in iis to a netcore service, hosted in
k8s, that splits out ingest, analysis, storage and queries into separate
domains, with the writes going to Azure Search Service*.

Our use case might be a bit weird -- this app is essentially an internal API
that supports the search needs of our other teams and their own products for
our own internal software. It probably has 30 million records across a few
different indices. We made the decision to migrate from lucene because of the
ease of clustering elasticsearch. We previously achieved availability by just
running multiple copies of the standalone service and doing smart health
checks at the load balancer level in case a lucene index got corrupted and
needed to spend a day rebuilding, but that didn't scale well for rebuild
times, and we have been consolidating all of our legacy tech onto netcore and
kubernetes.

Raw lucene was an order of magnitude faster than azure search service, but
that's probably more a function of being able to essentially query the indices
directly in memory of the webservice, as opposed to a slightly
underprovisioned search cluster with all the HTTP overhead. We're migrating it
to our own elasticsearch cluster right now for performance, cost savings and
cloud-agnosticism.

~~~
mish15
We have an early access product for personalized ecommerce search @Sajari if
you are interested. One early access company is on track to generate $30
million in additional revenue from switching (over 10% search conversion
increase). That is across millions of skus and hundreds of products updated
per second also.

We are also looking at releasing this as a k8s deployable product. It's all
k8s services and gRPC already...

------
pmarreck
Two years ago I decided to go with Postgres' built-in fulltext search instead
of adding another dependency like ElasticSearch, and I believe I've profited
from that in much less maintenance while still getting quite good
performance/features.

~~~
aldoushuxley001
Any tips for scaling Postgres-only fulltext search?

~~~
ngrilly
[https://github.com/postgrespro/rum](https://github.com/postgrespro/rum)

~~~
pmarreck
Good find, bookmarked!

------
flexer2
We use ES heavily. Most of our queries are basic document filtering plus some
geospatial stuff. We could probably have done it with Postgres/PostGIS, but
with AWS manages ES, it's all "good enough" \-- we can do geospatial searches
on millions of documents with response times around 100ms. The other part I
like about ES is how it's easy to scale out across machines, which lets us
handle quite a bit of load and tolerate failures easily. We have a cluster of
5 m4.large instances and it only runs us about $600/mo. Like others have said,
tuning AWS ES sucks, but it's always been good enough for us.

We've run into some pain points like trying to index very large shapes into a
geospatial index, but have workarounds for basically everything now. We also
had a problem where when AWS had the outage around autoscaling groups a few
months ago, we lost 3/5 of our instances and had to reindex some data from
backups. That was the worst thing that's happened.

I'm sure there would be better/faster/cheaper ways of doing what we do, but
for what we get out of the box for the price, it's going to take a lot for us
to move away from it for now.

------
niklasrde
Yep, most of the audience facing search functionality across our sites
([https://www.bbc.co.uk](https://www.bbc.co.uk)) is powered by various Solr
clusters, hosted on-prem and cloud.

------
quickthrower2
ES powers search one of my side projects
[https://dealscombined.com.au](https://dealscombined.com.au).

The ability to not only full text search but to do it fast, to tune the
lexical behaviour (lowercase, plurals, stemming etc.) and to top it all
combine geo search pretty much left any other solution in the dust. I even
considered some paid solutions.

I also considered postgres which looked strong but I felt it’s be harder to
set up these features and that the full text would be weaker although geo
might be stronger but my geo needs are simple.

ES was easy to set up to do this, taking about 2 hours of tuning. I used AWS
so I didn’t have to figure out how to install it. I admit I had a mental model
of ES from ELK-ing at work.

At some point when the site gets more traffic I’ll tune the search so that
rather than nearest matching, I’ll score bit the distance and the words and
order by perceived relevance. Ie weigh up both how close something is with how
well the words match.

ES is a pretty amazing tech and it’s the easiest way to set up a decent
quality free text search for your site.

------
holstvoogd
yeah, there aren't many alternatives unfortunately. I've used Sphinx a lot,
but am now stuck with ES and it is horrible to operate, probably because we
don't need a cluster solution so it is total overkill. Yeeeah for technical
debt.

For small projects (for some 1000s of documents), I'd probably go with
Postgresql FTS if possible. Sphinx/Solr for anything with indices smaller than
a couple 100GBs After that, ES seems reasonable & worth the overhead

EDIT: my biggest issue with ES is that it seems to be specifically engineered
to sell you support. So get a managed version if you can.

~~~
riku_iki
> I'd probably go with Postgresql FTS if possible. Sphinx/Solr for anything
> with indices smaller than a couple 100GBs After that, ES seems reasonable &
> worth the overhead

why do you think you can't throw 1TB of data on postgresql?

~~~
aldoushuxley001
Could postgres FTS handle millions of documents within a reasonable timeframe?

~~~
manigandham
Yes, it's searching against the ts_vector data types which can be indexed.

The problem with PG FTS is that it doesn't have advanced search
functionalities (fuzzy matching, faceting, term distance, highlighting
results) and it lacks the modern relevance scoring systems so that'll be the
limiting factor instead of speed.

------
otisg
At Sematext we help companies with Apache Solr and Elasticsearch. ES/ELK is
definitely used mode for timeseries sort of data. Solr community puts more
focus on full-text search (email search, product search, database search,
etc.). Elasticsearch can do that, too, and we regularly help companies who use
ES for that, but Solr seems more focused on that use case.

------
inertiatic
I work at a company who's a major player in online academic publishing.

We use Solr to power our main, end user-facing search after migrating from a
custom Lucene solution some years ago.

To me it seems Lucene based tools are the best for the job if the main thing
you care for is having text focused search with a huge potential for
extensibility.

But there are a lot of use cases where you will never need anything more than
the base capabilities of this technology (so you can be served by something
simpler to use or maintain nowadays) and there are probably a lot of use cases
where your search will be mainly driven by vector similarity (in which case
you are working around the limitations of picking a technology with another
focus).

As far as jobs go, I'm not sure how in demand specialists are. After a few
years of working in the field I had a look to see if I could leverage my
experience to get a remote position and came up with pretty much nothing.

------
james-skemp
The Sitecore CMS moved from License to Solr for search for on-prem instances.
After trying to run it on Windows we were happy there was a third party
provider that was easy to work with.

~~~
alasano
For Sitecore I will shamelessly plug Coveo for Sitecore:
[https://www.coveo.com/en/solutions/coveo-for-
sitecore](https://www.coveo.com/en/solutions/coveo-for-sitecore)

I think that we definitely have the best integrated and most full featured
solution for Sitecore customers.

It's not only about querying but also having a UI framework, built-in
customizable indexing, analytics tracking and access to machine learning
available in one package.

Source: Am product manager for Coveo for Sitecore.

~~~
james-skemp
Based upon community responses (and someone from the organization being in the
Slack group and responding as well), we ended up going with SearchStax.

I work at a state university (with the associated purchasing ... issues :))
and we had some staffing issues, which they were very accommodating of from
initial quote to subscription.

If we end up running into issues as/if we expand our usage (we're essentially
only using Solr for the mandatory bits) I'll keep Coveo in mind. :)

------
Snoddas
We use solr/lucene. It provides search and indexing for our CMS. It will in
all likelihood be retired when we change CMS, it's scheduled for autumn next
year (yeah right).

------
acd
Duckduckgo was/is using Solr

[http://highscalability.com/blog/2013/1/28/duckduckgo-
archite...](http://highscalability.com/blog/2013/1/28/duckduckgo-
architecture-1-million-deep-searches-a-day-and-gr.html)

------
thecatspaw
Yes, we use it as part of an ecommerce framework to index products, categories
and cms content for various customers. However we dont have a specific Solr
position, as we only need to make small adaptations which an average developer
usually can do/interfere from existing code.

------
crishoj
ES is backing the product search at ecommerce site
[https://www.imusic.dk/](https://www.imusic.dk/). Even with 16M documents, a
fairly intricate ranking function, and spelling suggestions, latency is in the
order of 200 ms.

------
paulirwin
Lucene is still great today for smaller indexes that can entirely fit in
memory and can be indexed quickly on app startup. Think something like
searching for a setting in Windows 10 settings, or if you had some other
fixed, small data set that you wanted to allow users to do real text search
without the complexity of a search service. Lucene is still helpful here
because of the analyzers, stemming, etc.

But for searching data that can grow and change over time, it's hard to
justify using Lucene directly anymore. Azure Search (I believe built on
Lucene) is an awesome (but relatively expensive) SaaS solution that is far
easier to manage than Elasticsearch.

------
srameshc
Search built using Postgres is underrated. It can do a lot if used properly.

~~~
joking
there are on different levels, a search on a sql database will have real time
results, while on solr / elasticsearch it's going to have a delay (from
milliseconds to minutes). That delay gives the advantage to build a series of
data structures much more suited for search than the ones on a database.

I built several search systems for classified listing sites, something like
solr is a life saver once you get enough traffic. Is much easier to scalate
than a sql database, and you can do much more things. The easiest example is a
facet, for example you make a search on a car listing site, and you want to
show how many cars from each brand you are matching, with sql you have to make
another query, while in solr you can get those results in the same query. Now,
add the model, the color, the gas type, the transmission, the place, etc. that
actually grows easily to something unescalable, while with solr you can do
easily.

~~~
glintik
ES search have no any delay, if data is really commited to index. That's also
true for any general DB - Postgres etc.

------
scalesolved
We use Solr heavily at
[https://www.helpscout.com/](https://www.helpscout.com/), it really powers a
ton of our functionality.

------
adventured
I'm still using an older version of Sphinx. I love it. It's fast, moderately
flexible, very lightweight, easy to set up and produces good enough results. I
have also found it to be highly reliable (at least the version I've been using
for years). It's not useful for anything that needs hyper scale (Twitter et
al), however for the next tier of scale below that it generally does well if
you know how to leverage its strengths.

------
rchasman
We use ElasticSearch at Lawmatics, and it powers more functionality than just
our search! We use it to power our Automation targeting engine, reporting
features, audience builder and pagination, filtering, sorting of data tables.

We denormalize associated records into one Index. And any record that we need
to find based on user-defined queries will go through ES since it's much
simpler to metaprogram queries across denormalized data (no conditional
joins).

~~~
yehosef
This is one of the undersung benefits of ES in my eyes. Relevancy results
requires tuning of the indices and queries and in most cases (that non IR-
experts would program) ES will give as good results and be as easy or easier
to implement as Solr.

But after you've gotten over that, you realize that this new tool can do lots
more things than just text search. Time series metrics, BI, predictive ML,
APM, etc. with relatively little work. With Solr, you could do those non-IR
tasks, but it's going to feel much more awkward, IMO.

------
jiripospisil
We use ElasticSearch extensively at our company but we don't use it for full
text search (in fact, we don't use its full text capabilities at all) but
rather for its ability to match and aggregate large data sets without having
to create any indexes at all prior (and it's fast, it still blows my mind a
little). This allows us to offer customers a way for them to create arbitrary
queries in our own little DSL.

------
tjpnz
My experience is mostly Japan-centric nowadays but SOLR is very widely used
here and there is demand for people with that background. A lot of work has
been done with SOLR to better support the intricacies of dealing with Japanese
text which differs substantially from other languages. Most of the search and
NLP jobs I've seen recently outside of Google and Amazon expect some SOLR
experience.

------
porker
This has been a great thread, and there's some heavyweight indexes here. But
what about at the other end of the scale?

Say when you've got 10k-50k contact details (name, email, phone) and you want
to provide a quick, autocomplete lookup. I've used basic SQL string matching
for this, but it doesn't catch mis-spellings and the rest.

Running SOLR or ES is overkill for this. Is there a tool that fits this niche?

~~~
rpedela
What's wrong with running Solr/ES? It is trivial to run either in standalone
mode, and it is a lot easier to set up autocomplete with misspelling support
than messing with PG. Algolia is a good option if you have the budget.

~~~
porker
> What's wrong with running Solr/ES?

With this small quantity of data, usually the app's running on a small VM. I'm
wary of running anything Java, having had it require large amounts of RAM
before.

That said, I haven't touched JVM stuff for 5+ years.

------
sciurus
While I was at Eventbrite we were using Solr and started moving to
Elasticsearch. I know one of the main people I worked with on that recently
left for Github, which also uses Elasticsearch.

At Mozilla I work one project with a search component ([https://crash-
stats.mozilla.org/](https://crash-stats.mozilla.org/)), and it uses
Elasticsearch.

~~~
JnBrymn
:handwave:

------
sidcool
ElasticSearch is widely used in enterprises for full text searching.

------
kodz4
Wikimedia uses ES and you can download their entire index for any of their
sites wikipedia/travel/quotes etc.

------
truth_seeker
Yes. Also, just recently MongoDB v4.2 added Lucene as an embedded engine for
text search capabilities

~~~
yla92
Do you know if it is coming to the community edition or is it only for Atlas?

~~~
truth_seeker
Atlas.

Source : [https://www.youtube.com/watch?v=4QUGWnz-
XaA](https://www.youtube.com/watch?v=4QUGWnz-XaA)

------
Jeremy1026
We use ES for our searching data within our application. We store about
20,000,000 rows of data in the primary table, with plenty of dependent and
secondary tables. ES takes the load off our MySQL cluster for generating
reports and fulfilling searches.

------
darkhorn
The company I'm working is using Solr. We collect the data by either paying,
for free or by using crawlers. Then we query the data from Solr if it is not a
meta data. And there are other types of databases too that we use.

------
csixty4
ES is somewhat popular in the enterprise WordPress space, driven largely by
10up's ElasticPress plugin which uses it both for search and to improve the
performance of database queries over MySQL.

------
ketanhwr
Atilika ([https://www.atilika.com/en/](https://www.atilika.com/en/)) uses
Lucene/Solr for their NLP based search products.

------
samsk
I'm using SOLR for public facing search API/engine on several projects (own
and customers). ES is imho better for doing various on-demand analytics (like
dev logs search)

------
airocker
ES is used in code search tool dxr which we locally use as well:
[http://dxr.mozilla.org](http://dxr.mozilla.org)

------
termez442
Data discovery: this is the key concept. they are absolutely unbeatable at
this and for this you can use them for a lot of things. Http://siren.io

------
janemanos
I guess more and more people are turning towards search engines directly
integrated into databases, like ArangoSearch of ArangoDB to combine search
with other needs [https://www.arangodb.com/why-arangodb/full-text-search-
engin...](https://www.arangodb.com/why-arangodb/full-text-search-engine-
arangosearch/)

~~~
mosen
I assume you work for ArangoDB?

~~~
mixedCase
Looking at his/her post history, he/she does but rarely discloses it.

------
Risse
Yeah, I work pretty much daily with Solr and related stuff (PHP, JSON). Still
used quite a lot in PHP and Drupal scene.

~~~
Kimitri
Yup. Lots of Drupal sites use Solr. There are very good contributed modules
that make using Solr with Drupal a doddle.

~~~
MaddAgent
Same here - using SOLR with Drupal and it's pretty simple and effective

------
lightbyte
Yes, I work on the search team at my work and we extensively use Elasticsearch
for many different search services.

------
marcusae313
Using SOLR heavily. I have a strong preference for SOLR when it comes to Full
Text and ELK when it comes to logs.

------
znpy
I've seen elastic search used as a document indexing engine and thus also as
search engine.

Also solr, to although a lot less.

------
altendo
Are people still using Sphinx Search
([http://sphinxsearch.com](http://sphinxsearch.com)) at all? It doesn't seem
like it gets many releases anymore...since they unpublished the source code,
it's hard to see how much activity there is.

~~~
pQd
[https://manticoresearch.com/](https://manticoresearch.com/) is the lively,
open source, fork of Sphinxsearch. that's where some of the earlier developers
from the project moved to. it's used as a text-search backend on craigslist.

~~~
jzawodn
Definitely. I love me some manticore. :-)

------
dennisgorelik
We use ElasticSearch version 1.7 at postjobfree.com

The reason to use version 1.7 -- is faster percolation (reverse search).

------
redwood
MongoDB just launched beta support for lucene index based search native to its
MongoDB Atlas platform.

------
univalent
Still using Lucene here searching through short text documents in a Java based
server product.

------
dlphn___xyz
ES has far more use than devops - most common usecase ive seen is tagged
document searching

------
larrik
Yes, quite a few of my clients use ElasticSearch extensively outside of the
ELK stack.

------
withinrafael
Data point: We're actively working on migrating from Autonomy to Elastic
Search.

------
jinglebells
Nobody seems to have mentioned that GitHub uses ElasticSearch.

[https://www.elastic.co/use-cases/github](https://www.elastic.co/use-
cases/github)

------
v8engine
Does anyone have any production experience for Yahoo Vespa? I've heard it
competes with ES in opensource search.

[https://vespa.ai/](https://vespa.ai/)

------
dynamite-ready
Xapian. C++ based. I really like the API.

~~~
leetrout
I came looking for a Xapian comment. It is HIGHLY underrated.

I love the "just works" functionality and portability.

At a previous job we had two existing options for search, Postgres GIN or the
heavyweight ES cluster with Kafka. When I recommended grabbing Xapain for
simple indexing (2-4k records, toss the index whenever we needed an update was
OK) no one would bite.

------
krossitalk
Making use of the ELK stack currently

------
Moto7451
A lot of ElasticSearch is in use at my work for search feature work.

------
nkristoffersen
We are moving from FAST to ES. Major pain.

------
marcusae313
SOLR for the full-text win.

------
vast
Lucene is the carry.

