Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Are Lucene/Solr/ES Still Used for Search?
275 points by lovelearning on July 19, 2019 | hide | past | favorite | 219 comments
I casually visit jobs/freelancing sites once in a while. I don't see as much demand for Lucene/Solr/ES skills for website / text / document search or other kinds of information retrieval, as I used to about 4-5 years ago.

ES seems to be the most popular but only in its ELK avatar for devops dashboards.

What technologies are you people using for text or document or website search nowadays?

Very much so. For retail/catalog search SOLR dominates. There's a lot more customization available for relevancy/ranking OOB than Elastic. Drawbacks are managing indexing - SOLR cloud is much harder to manage.

For commodity search workloads (general retrieval/faceting) Elastic does a fine job. It scales well and there is good documentation and support.

Lucene is the core engine behind both of these solutions.

For fun, lets look at the large Enterprise acquisitions over the years:

* Verity - bought by Autonomy

* Fast - bought by Microsoft (Also known as the Enron of Norway...)

* Autonomy - bought by HP (Look at the backstory on this deal!)

* Endeca - bought by Oracle

* Vivisimo - bought by Oracle

* Google - GSA (now Google Cloud Search, hosted solution)

Next, follow the path of online acquisitions:

* IndexTank - bought by LinkedIn

* Swiftype - bought by Elastic

There's a number of interesting independent players still. Coveo plays in the Enterprise space, but it's a hard market. Algolia is doing great in the commodity online search space and seems to be growing well.

This is an area I think is open to more competition. Especially with AI/ML technologies available around Document Understanding - the Enterprise market is open for a good on-prem upstart to really take off.

Ping me offline if you have additional questions - spent almost 20 years in the space and ran a search company of my own.

* Fast - bought by Microsoft (Also known as the Enron of Norway...)

That one was painful to live through, we got forced to migrate to Windows and everything went sideways. That was almost 10 years ago with quite a big cluster (tens of nodes).

You were never forced to migrate to Windows. In fact, the last major customers on ESP were using Linux to the end.

Don't forget Powerset!

>"For retail/catalog search SOLR dominates."

Interesting, could you elaborate on why SOLR is dominant in that space over say Elasticsearch?

Definitely. Product catalog information as a whole doesn't change often. Price and availability does. With retail catalogs you often do a full reindex of the data in your master catalog and then run partials to account for price/availability if you don't do that in realtime using filters. Since the system of record is not the search index, Elastic is often not a good solution here.

Also, relevancy in retail is often influenced by other factors that cannot easily be implemented in Elastic. TFIDF/BM25 search is available in both platforms, but you may also weigh in other factors such as relationships with the vendor, stock on hand, or other ML techniques that are more easily implemented in SOLR.

One more point - if you can run the entire index on one machine it makes deploying and managing SOLR much easier to manage than Elastic. The complexity only grows when you have a distributed system. You can fit a lot on a big box.

Yes, it's certainly easy to manage on one system. As the production system I work on is an academic one with few concurrent users it's possible to get away with that.

Thanks for the explanation and insight. I hadn't ever considered how different the catalog use case is from my own use cases. Makes good sense. Cheers.

I think a big reason is simply that Solr has been around a few years longer and most older sites would have been using that by the time Elasticsearch came along and never saw a good reason to go through a painful process of replacing one with the other.

These days, I would say there's very little that either product does that the other product can't do though obviously there are lots of strengths and weaknesses on both sides.

There is a long term trend of search engine companies being acquired, going away, products ending, and customers left in the lurch....

Vivisimo was acquired by IBM.

Source: I have the capital V from the building in my house.

(I see now atambo has already noted this typo)

We have on prem Coveo, which is based on ES. It is/they are horrible. We pay a large amount in Enterprise support and maintenance, and it is next to nonexistent.

Thankfully we are moving to Azure Cloud Search services, which just work, over the next year.

Good bye manure pile, hello compost.

There are no contacts in your profile (for pinging offline)


Please update your profile with your contact information.

thanks for catching. updated.

Vivisimo was actually bought by IBM, not Oracle.

Elasticsearch, definitely. I always recommend using it in hosted form and not running your own cluster. That allows you to focus on getting data in and out of your cluster instead of sinking time into doing devops.

While I have not used Solr recently, it has evolved along with ES as a solid product with a solid community. Nothing against it; it's a solid choice and there are probably people offering to host that as well. Either way you are using Lucene. Using Lucene directly makes no sense unless you know what you are doing and have a need to do that. If you have to ask that means it's not for you.

If search is important to your use case and things like relevance, precision, and recall have real impact on your business, you should get some specialists involved and not reinvent the wheel or make all the rookie mistakes. Somebody like me basically ;-). However, that can be expensive and if search is not that critical, just sign up for one of the search as a plug in solution type products out there and don't bother running a lot of infrastructure. E.g. Elastic offers a thing called App Search and it probably covers most of the simple needs and is stupidly easy to get started with. There are several competing products probably that I can't vouch for.

You can always upgrade to something proper later. Things like mongo and postgres also have some limited capabilities here and you can get away with doing some simplistic stuff with sql even. However, there's a point where you hit a brick wall and there's something you need that is simply hard/impossible using that or where you end up reinventing a lot of stuff that things like Elasticsearch probably do better.

> Elasticsearch, definitely. I always recommend using it in hosted form and not running your own cluster. That allows you to focus on getting data in and out of your cluster instead of sinking time into doing devops.

I think everyone doing Elasticsearch well has to bring it in-house eventually. AWS's hosted solution is poor, Logz.io and ElasticCloud are expensive.

There's a 7-figure/yr Elastic Cloud customer I work with who is so tired of Elastic just randomly killing their clusters out of nowhere and having to spend basically triple to deal with it that they're bringing it all in house.

Elastic Cloud is pretty reasonable IMHO. We have a logging cluster that costs us around 250 euro per month. Yes, running it ourselves would be cheaper but doing that in Amazon would eliminate most of the cost benefits. If you then consider devops cost needed to that, there is no difference. Most companies getting started with Elasticsearch should probably not do that on day 1. You can always start doing this later if there's some reason to.

I've never had clusters being killed randomly but I did have a few self inflicted issues with cluster instability due to flooding the cluster with too much data and not having suitable data retention policies in place. With the recently added index life cycle management (an x-pack feature), this is easier to manage these days.

If you are spending seven figures a year on elastic search, you clearly are not a beginning user and there's some cost savings that you might be able to realize by taking ownership of the problem of hosting it somewhere cheaper. For that, their recently open sourced kubernetes helm charts are worth a look. Those scripts take care of a lot of things and get you a self hosted version of Elastic Cloud.

Amazon hosted clusters are indeed a bit bare-bones (as is their support for these clusters) and I would also not recommend them; you get more value for money by using Elastic Cloud.

> so tired of Elastic just randomly killing their clusters out of nowhere and having to spend basically triple to deal with it that they're bringing it all in house.

That is because Elastic Cloud is not a fully managed Elasticsearch. People often don't get that with Elastic Cloud you are still responsible for your ES cluster. That's one of the differentiators that e.g. Sematext has (disclaimer: founder).

> AWS's hosted solution is poor, Logz.io and ElasticCloud are expensive.

Right. Have a look at https://sematext.com/logsene pricing. People say Sematext compares favorably to Logz.io and Elastic Cloud.

Why are all ELK stack based logging solutions so much more expensive than custom rolled solutions like logdna, datadog, papertrail,etc.

The per gb cost on the other ones start at 1.20 $/gb and goes till 2$/gb. While almost all hosted ELK solutions start off at 3$/gb.

Im asking because i would very much like to adopt an ELK based hosted solution..but I'm not able to justify paying double. Is it that running+resource costs for ELK are so high that the extra charge needs to happen ?

Yes it's mostly about resource costs. ES is a generic search that can handle logs but isn't focused on it, and indexing everything can be costly. There are major efficiencies gained in creating your own log-focused storage and access methods with object storage, columnar formats, and zone mapping, etc.

Datadog, etc are not cheap when you're paying per agent/per month to ship your logs in the first place. Which they want you to do, of course.

The pricing is not per agent per month. The pricing for everyone is always either per million events or per gb.

You do not need an agent. You can ship directly from syslog ( https://docs.datadoghq.com/integrations/rsyslog/?tab=datadog...)

> AWS's hosted solution is poor

What has been poor about it in your experience?

Not the GP, but you aren't allowed to touch settings like the shard recovery rate.

If a configuration change (changing # of nodes, instance type, etc) goes wrong, your cluster indefinitely gets stuck in Processing due to a race condition. The only way to get unstuck is to file a ticket. The company I'm at doesn't pay for AWS support, so at one point we ended up completely tearing down our cluster and rebuilding a new one (via Terraform) after getting tired of waiting for the reps. (They advised us to cut off log flow to let the system get out of processing, which we did, but it didn't work because once it gets stuck in processing like that it's just completely stuck).

It's difficult to troubleshoot issues - you can get some logs via Cloudwatch, but they're hard to search through and I'm not entirely positive everything shows up there.

Amazon is always several releases behind Elasticsearch versions.


Elastic.co's offering looks much better, just by reading their excellent comparison article: https://www.elastic.co/blog/hosted-elasticsearch-services-ro...

(We haven't used Elastic.co but what they say makes sense and I imagine their service is much better)


Once you hit a big enough scale - for us, we're pushing about 2TB a day of logs (that's before accounting for replication of course), it doesn't make sense to stay on Amazon's hosted service.

I'm in the process of advocating for in-housing our Elasticsearch setup and just building on top of ec2. Elasticsearh seems like the perfect candidate for Kubernetes since rebalancing is automatic and the affinity rules are simple (every Elasticsearch instance needs its own node). Cluster autoscaling (i.e. node-level, not _horizontal_) just makes too much sense.

Unfortunately I haven't gotten the go-ahead to take it inhouse, but I've been gunning for the project for some time now, so I'm hopeful I'll get the opportunity.


BTW, totally unrelated but for anyone managing Elasticsearch, make sure you have your shard count tuned properly. When I came to this company, they had their data way oversharded with primary shards varying between hundreds of kb to a few gb; i.e. orders of magnitude difference. Switching to ~50 GB shards (done via simplifying the way we were indexing) massively improved performance.

Also i3 instances > anything with EBS.


Check out the recently open sourced kubernetes scripts: https://www.elastic.co/elasticsearch-kubernetes. There's no need to reinvent all of that.

ES is indeed super solid if you know what you are doing. However, most users getting started with this are probably going to find out a few things the hard way though; which is why I recommend hosted solutions as it removes quite a bit of non trivial devops from the equation.

Not OP but the list I had from a year ago, didn’t support all multi az (only 2), lack of admin features meant it broke and you could work out what happened, limited choice of instance type, the backup happened once a day regardless if you needed it and it had issues under load.

The backup issue has been resolved I believe but the others...

On AWS I’d suggest elastic on ECS and you can use the leftover compute on the cluster to run other applications effectively for free.

Also, some clients may prohibit uploading of data to third parties. In those cases you simply have no other choice than to run your own cluster.

I think it depends massively on how much data you have. A great many companies and websites only need a few hundred megs of text data indexed, which is easier to outsource.

Once you grow larger than that though, the hosted service prices get astronomical compared to standing up a cluster, assuming you have someone who can admin it.

Things are not quite cut and dry operationally. Some places may have only a few hundred MB of data but they need really high availability and performance (probably picked the wrong solution IMO but still...) which is better guaranteed hosting it yourself and perhaps even outside the cloud. Most places the availability of whatever a hosted solution provides is a better value than spending engineering hours to maintain these things. Devops / SRE folks are expensive compared to most other engineers that deliver features primarily.

My point is that a ton of companies (more for cultural reasons than business requirements IME) are so freaked out by any downtime in any way that they’re going to pay for an engineer to just maintain these things themselves.

The only issue I have with this suggestion is that all of the situational awareness needed to _use_ Elasticsearch effectively is the same as is needed to _operate_ Elasticsearch effectively.

If you're just spinning up something's defaults and throwing data in it, that bill is going to due eventually and it's probably going to be ugly.

We host a smallish (12 large node) cluster ourselves on AWS, and it's been nothing but SUPER SOLID for us. Literally 0 issues in the year. We use it for analytics, aggregations and the like, as well as domain-specific search, for which it's a great fit.

I agree, if ES is scaled well for your usage it just keeps going. However, a big time sink for me has been reindexing, lost data due to bad mapping, and version incompatibilities.

I've found https://www.npmjs.com/package/elasticdump to be VERY useful for doing things like copying indices and recovering data. Much more than the built in stuff.

AFAIK, Reddit, Slack, Dice, Bloomberg, IBM, Apple all use Solr. Jira and Confluence use Lucene. Others use Elasticsearch and Fusion (commercial product on top of Solr). See, for example: https://www.activate-conf.com/more-events for presentations from several past years on who and how uses Lucene/Solr.

Also, the new trend in jobs is "Relevancy Engineering", which is less about just setting up search engines and more on actually tuning them. That's where also Machine Learning and other AI techniques come in (Learning to Rank, Named Entity Recognition, sentiment analysis, etc.). Which was recognized by rebranding of the conference from Lucene/Solr revolution to Activate last year.

See also Haystack conference which focuses very specifically on relevance regardless of the specific search engine: https://haystackconf.com/

Until about 2014-2015, many large companies wouldn't have looked twice at Elasticsearch. The companies you list using Solr have been invested in search for 10 years or longer (predating Elasticsearch), and may have high switching costs.

For sure they would have high switching costs. They would be switching from an open-source (and free product) to which they contributed changes to a commercial product. License alone would be a serious discussion point. So, that's a fact.

Was there an opinion in there as well that you tried to convey?

Bloomberg, Apple, IBM also use Elasticsearch or more Elastic stack!

From what I read, wikipedia uses ES.

From my experience, yes. Lucene is a "production" state of art library and Solr/Elasticsearch is very used in many scenarios.

This expertise is very on demand.

My company personally migrated from ElasticSearch to https://vespa.ai/ and could not be happier. Faster and way easier to maintain a cluster. The "Application Packages" feature present in Vespa opened many opportunities to improve our product( Curiously we use Lucene inside our custom application for a "Search map" functionality, for something like that https://www.lexisnexis.com/en-us/products/lexis-advance/sear... ) . I highly recommend it!

I've looked at Vespa a bit. It looks pretty good.

It's also readily apparent that it's an ancient system that's grown out of an in-house project, and that its design has accrued a lot of oddities over the years from lack of careful, co-ordinated design. It includes a bunch of esoteric features (like the "predicate" function) that have obviously grown out of Yahoo's own internal architecture.

One curiosity is Vespa's approach to schema and configuration changes. To make any kind of change, or indeed set up an index, you have to create that "application package" containing your schema and configuration in the form of files, and then use separate REST APIs to "upload", "prepare" and "activate" it. There's a CLI tool to help perform those steps, at least.

It's nice that they're more consistent and rigid about schema and config evolution than Elasticsearch. But it's not exactly operator-friendly, at least not for first-time users with no pre-existing operations based around Vespa.

The package design also makes it more cumbersome to perform programmatic updates for a schema. I once worked on a SaaS project where we indexed data in Elasticsearch — arbitrary documents where we didn't know the schema ahead of time, because we just accepted any JSON document posted by the client. With ES, we could just use its dynamic mapping support, which automatically creates field definitions when new fields arrive (using regex-based templates). Do you know how long a package update takes in Vespa, to add, say, a single field?

The Vespa documentation is also pretty terrible, in my opinion. They explain a lot of things, but it's confusingly written, uses a lot of homegrown terminology, and neglects to collect all the reference documentation in one place. For example, you can't find an overview of the entire API — fragments of it are just scattered across a dozen or so unrelated pages.

Lastly, Vespa is Java. One of the biggest challenges maintaining Elasticsearch is controlling its resource consumption. You have to give it a lot of RAM, and it's never clear how much it needs and what configuration settings and usage patterns affect its memory use. Tuning it is something of a dark art. I don't know exactly how Vespa is implemented (is it all pure Java?), but I'm worried that, being a JVM app, it has the same shortcomings.

Predicate fields are indeed an oddity, but not an architectural one - it's for situations where the documents need to specify criteria (predicates) for when they should match - like only match for certain users, certain times of day etc. It's probably an underused feature imho since most people don't know this can be done efficiently.

If you have dynamic fields like in your SaaS example I recommend using a single map field rather than let data not under your control drive changes to the set of fields.

> Do you know how long a package update takes in Vespa, to add, say, a single field?

A few seconds. However, rather than having operators do any of this manually, set up an automatic process which deploys on each change made to the repo (i.e do CD).

> all the reference documentation in one place


Another thing is that Vespa doesn't seem to support indexing of nested data, either structs or arrays of structs. For example:

    "location": {
      "city": "Washington",
      "state: "District of Columbia"
    "friends": [
      {"firstName: "Bill", "lastName": "Clinton"}
Maps aren't suitable here because they can't be used for ranking. So you have to use structs, but those aren't indexable.

An application's search module could flatten the location key (e.g. "location_city", "location_state") for simple attributes, but the same is not possible for the array, since there can be arbitrary array elements. And you can't split it to an array of strings:

  "friends_firstName_elems": ["Bill"]
  "friends_lastName_elems": ["Clinton"]
...because queries like "firstName contains 'Bill' and lastName contains 'Clinton'" could match different records ("Bill Bryson" and "George Clinton"). Never mind deeply nested arrays of objects containing arrays containing objects containing arrays.

This seems unnecessarily restrictive. A search engine should be able to index the data you already have, not force the application to contort its data to whatever shape the engine requires.

Is there no way around this?

Thanks. I'm still learning about Vespa, and it's still not clear how map fields work.

Edit: Documentation says: "Accessing attributes in maps and arrays of struct in ranking is not possible". So maps aren't really usable.

Regarding how long it takes to update a field, the application I described would have to do this programmatically. It would have to keep track of all known fields in some kind of registry, and then if a new unknown field came in, it would have to perform an "application package" deploy just for that field, using the REST API. (Unless there's a less cumbersome way to do it?)

Reference docs: That's nice, but that's just a bunch of links. Good reference documentation has tables of contents. Bonus points for runnable examples in multiple languages. For an example of good reference API documentation, look at Stripe's [1].

[1] https://stripe.com/docs/api

Just an FYI to your last paragraph: The core indexing/ranking/storage components of Vespa are C++, and run in a separate process (no jni).

In my own attempt to compare the two, I found the memory consumption of Vespa was easier to predict and understand (there are formulas for it in the documentation).

Thanks, I didn't know that!

I am very interested in hearing about your experiences migrating to vespa. When it was opensourced we all thought it would be an absolute game-changer, but I see so few people building new products (or migrating existing products to) vespa.

You can contact me. I'm also planning to do a blog post comparing with Solr and Elasticsearch. I think that naturally it takes some time to adopt a solution like that. And the ecosystem it still at it infancy. But, randomly, a project using Vespa appeared in my GitHub timeline today (https://github.com/rdoume/News_API). So, the adoption is increasing.

For me Vespa is a absolute game-change, in features and as someone said here, ES looks like it intentionally complicated to maintain. With nodes randomly getting unhealthy. Vespa is like Redis to me. I completely forgot about maintaining it and works great.

It makes a world of difference in our product, and I take every opportunity to evangelize it.

That's interesting to hear. Would love to read about how Vespa compares to Solr and ES.

This may be of interest to you: https://sematext.com/opensee/report/project/trend?q=ElasticS...

Would you happen to know how Vespa compares to ES in terms of memory or CPU footprint? Have you done apples to apples comparison by any chance?

I do not have a completely fair comparison. But a migration from Elasticsearch 5 (2016) to the Vespa 7 (2019) we reduced half of our nodes, and cut in half the average response time. Another amazing feature during the migration, is that Vespa allows you to reduce or increase the number of nodes dynamically. And it take full care about the data distribution. In ES we (used to) had to follow the limits of the pre-configured number of shards/replicas during the Index creation.

I have been wondering why Vespa isn't getting much traction. Everyone still defaults to ES, even in new project.

It should be marketed better, I feel. Some SEO for "Solr/lucene vs X" queries might help. I have been spending the last 3-4 months studying open source and commercial search systems, but it's only in this thread that I discovered Vespa.

This sounds very interesting! What is the size & scale of your data, if you don't mind sharing? (How many documents, total storage footprint, etc) Thanks!

I work professionally in this space and I can say over the past ~10 years, my full time job at 4 companies has been almost entirely to migrate away from SOLR / Lucene solutions and implement custom in-house search indexes.

Most of the time it has been for performance reasons. SOLR / Lucene have very poor performance characteristics, especially when needing to support custom sort ordering and heavy use of filters.

On other occasions it has been because you can’t easily extend search indexes to more advanced use cases, like similarity-based reverse item search, collaborative filtering, more advanced treatment of cold start issues / bias towards existing popular content / trending search.

A lot of small or medium sized companies naively figure they’ll just use SOLR etc to get something out of the box, or for side channel problems that are smaller scale.

But you come to regret it pretty fast because you end up needing one standardized way to build and deploy search indices and it has to support all the bells and whistles that SOLR can’t _and_ be faster than SOLR, for the big product use cases.

A lot of product companies now are beginning to use word-vector approaches with nearest neighbor libraries like ANNOY, as the first solution instead of the solution you eventually have to migrate to when you realize SOLR does not actually support your use case, not even as a means to get it up & running quickly.

Are you seeing a blueprint start to emerge for a standardized way to build and deploy search indices in the context of applications that need vector-space features? (E.g. if you start with ANNOY, you get kNN but then how do you add in the ability to refine, filter, rescore, sort with text, etc?)

I actually have seen the opposite trend. You don’t want to standardize the search engine and that has been a major problem with SOLR.

Instead, you want to custom build the data retrieval system so it’s tailored to your use case.

One example from experience was needing to add hard filterable metadata to an in-house search index. We solved this by actually calculating bit masks that represented all the filtering criteria and having a frontend preprocessor that would first restrict to the filtered subset and then do TFIDF-based relevance sorting.

Creating the bit mask tooling ourselves (instead of relying on whatever baked-in method of scanning items and filtering that comes with out of the box search engine tools) allowed us complete control over the trade-offs, particularly managing document deletions and optimizing run time performance in certain ways that just weren’t available in out of the box tools, as well as being able to integrate any in-house code into the search engine as needed (since the whole system was in-house code).

You want to create data models that are highly application specific, and then route data into them. The mistaken approach of one-size-fits-all tools, especially in information retrieval, is to pre-define the supported behavior of the application, like a web service wrapping a search index, with baked-in assumptions about the trade-offs and only limited support to modify or configure the trade-offs under the hood.

The gravest mistake is thinking just because your use case seems to function OK with those assumptions now, that you can marry yourself to the underlying data model. Then in the future you’ll hit the point where you have to throw it away and create something custom, but it will be far more costly to do so and extremely hard to migrate gracefully and ensure integrations are working.

Afaik almost everything runs Lucene under the hood, it's 20 years old, no one is going to build something as good any time soon. I suppose some company like Google have their own in house solution but otherwise it'll always be something built on top of Lucene.

I guess you don't see much demand because for a lot of use cases the basic setups are good enough.

About seven years ago, I got a contracting gig for a website that wanted a "search engine". I remember thinking "Solr/Lucene is old, not pure-functional, and therefore awful!" and decided to build my own. Somehow I even managed to convince the client that this was a good idea.

I ended up trying to reinvent Solr for the client, realizing after about two days of trying to reinvent stemming and indexing, that this was stupid to do on someone else's time, and called the client to tell them that I'm moving to Solr, and I got the project done before-schedule as a result.


I think for 99% of usecases (involving search), Lucene/Solr/ES is perfectly fine. However, I do absolutely hate that some companies have decided to make it their primary database.

EDIT: I just want to make it clear, I think it's totally valid to try and reinvent Solr for fun, or if that's something you're paid specifically to do; nothing is perfect, and I am actually a big fan of the "if it works, break it and make it better!" mentality.

I can chime in here that lucene- based solutions are sufficient almost always, for a purely frontend, js-based fuzzy search engine check out fusejs. https://fusejs.io/

Can you explain a bit why ES isn't a good solution for storing data itself?

I inherited a legacy Mongo solution, and all the data is duplicated and indexed in ES, so I've always wondered why we're using both. Mongo has none of the SQL capabilities that would make my life easier, and the types of queries allowed by Mongo could be done with ES.

What are the negatives of ES alone?

It's not reliable: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...

The v7 upgrade to a new cluster protocol (zen2) has improved things but overall the system has a long history of losing or destroying data. It's better to have a primary OLTP system that's ACID and reliable while using ES as the secondary search source. You can also remove the _source field if you just need matches without the original content.

It's common to see pattern used with a relational database since, as you can see, Mongo doesn't buy you much else as another document-store.

Thank you very much. It seemed like we were just duplicating a bunch of scheme-less json for no reason, but if ES can lose data, that's probably not a good idea.

Follow up: MongoDB is adding full-text search capabilities: https://www.youtube.com/watch?v=4QUGWnz-XaA

> "if it works, break it and make it better!"

Fix it 'til it breaks is what I always say :D

Needless to say my 3D printer had a lot of down-time haha.

"If it ain't broke, open it up and find out what makes it so bloody special" - i think that's wisdom from the BOFH.

Agreed re: using ES as primary storage (which it is NOT meant as) - as far as I can tell, it might even make you in breach of the GDPR [0].

TLDR: Lucene < 7.5 won't merge segments larger than 5GB (default) unless they accumulate 50% deletions.

Delivering a conference talk [1] later this year about it.

[0]: https://www.eivindarvesen.com/blog/2018/09/16/elasticsearch-...

[1]: https://2019.javazone.no/program/3f7cd8a7-a9ea-4874-a7dd-531...

> Agreed re: using ES as primary storage (which it is NOT meant as) - as far as I can tell, it might even make you in breach of the GDPR [0].

How is GDPR compliance of having data in Elastic Search influenced by it being primary vs. secondary storage?

See my reply to bryanrasmussen for a full explanation.

Basically: you reindex ES periodically, so when a user is deleted from the primary, it will disappear from ES upon the next reindex. The old index is deleted at the file system level.

At some point, though, the pedantry can get out of hand. After all, 'deleting' at the file system level is just 'unlinking' the inode from the underlying data blocks... in fact, data forensics at the file system level is probably more well-understood than recovering deleted data from a Lucene shard.

at what point would you be able to 'delete' data without being in violation of GDPR?

I know - it really is a matter of definition.

Though the EU has said it will consider intention etc. there's really no way of knowing for certain until when and if it's settled in a court case.

I think the instructor's response in the linked article is a reasonable defense, you don't really know that data is deleted all the way down to the file level. It is just marked as deleted and could be retrieved by someone clever enough to do so. At some point in the future it will be really deleted.

I don't think the GDPR regulatory agencies are operating at a technical level that they would make an argument that it was not a good enough deletion.

Finally I have to ask this part: assuming ES is not your primary database, how does this get around the GDPR issues? If someone wants their data erased you are supposed to erase it from wherever you store data, I suppose this means ES when it indexes a primary store and finds it has deleted data actually deletes it but if it is told to delete something it keeps it around?

When using ES for indexing and not the primary store, you can (and should) periodically fully reindex the data set. You can use a blue / green pattern — create a new index then swap from the old one to the new one. ES supports aliases, making this swapping transparent to the apps using the index. Now you have more options.

If it is easy to delete specific users from the primary database, the deleted users will naturally disappear during the next ES reindex.

Edit: The old index is deleted at the file system level.

If the reindexing occurs daily or weekly, perhaps this will satisfy GDPR.

There are other good reason to not use ES as the primary data store. First, it isn’t entirely reliable. It’s good and I’ve never seen a corruption, but ES and Lucene’s history isn’t as a reliable database. Second, if you want to change how you index, it is a bit easier to do if the source data is outside of ES.

thanks, I wasn't arguing that using ES as primary was good. Just don't necessarily see the GDPR argument as being a reasonable one. Although I've seen some startups using Mongo as primary and have to wonder if there would be that big a difference in using ES at that point (not a Mongo dig as I've kept away from it for various reasons)

There's folks working on Bleve (written in Go) and developers that I work with want to use it (we use Elasticsearch heavily), but as I've told everyone like you just did, Lucene has a 20 year head start.

Thing is, there's heavy demand for something more performant than Elasticsearch, so eventually the market will provide.

Meanwhile, Redis Enterprise is trying to grab some 'share with RediSearch, which has some severe caveats IMO that make it not a great fit for most.

Tantivy is an interesting project I'd point to in this space:


That said - it's effectively Lucene rewritten in rust, so the main win is some performance gains. Lucene has spent a ton of time getting the details right, and it's unlikely we'll see an order of magnitude of innovation in that particular space. At the higher level for querying / query understanding it feels like there's still more technological room to grow vs the lower level details.

Tantivy main dev here. Thanks for the free marketing :)

It is not exactly a port but yeah. tantivy is strongly inspired from Lucene.

> Lucene has spent a ton of time getting the details right, and it's unlikely we'll see an order of magnitude of innovation in that particular space.

Have you checked out the perf gain in Lucene 8.0 ? Block-WAND proved you wrong.

I suppose I could have phrased that better. I appreciate the correction. I mostly mean to say that the functionality is really impressive today, and serves its use case very well for the intended target of lower level search primitives.

Tantivy is a cool project, but I have to say the part I love most about it is your blog posts on it. They're a great introduction for people who are unfamiliar with the underlying tech of search engines.

> Tantivy is a cool project, but I have to say the part I love most about it is your blog posts on it. They're a great introduction for people who are unfamiliar with the underlying tech of search engines.

Thanks a lot! I am not a native speaker, and I often feel very bad at conveying engineering concepts. The positive feedback is actually very helpful :)

Seconded Tantivy. Very easy to set up and maintain, and faast.

If you’re looking for something more performant and you’re in retail I’d recommend taking a look at Apptus eSales. Their product has displaced ES/SOLR at several retail websites in Sweden. https://www.apptus.com/

Disclaimer: I’m a previous employee but have no economic interests in this as it’s not a publicly traded company.

We use Solr/Lucene heavily, ingesting about 3TB a day. We had to build our own clustering since we started before the Solr cloud project. We have been very happy with the results.

>Afaik almost everything runs Lucene under the hood, it's 20 years old, no one is going to build something as good any time soon.

Vespa [1] would like to have a word with you.


move on! IMO donated tool, Yahoo can't maintain it, give it to community reduce the cost. If only it was a new project which wanted to address search in a different way.

If anything Vespa has actually gotten faster in development since it Open Sourced.

I’ll just chime in to say that Algolia runs on an home-made C++ engine. [I work at Algolia]

Wonder what DuckDuckGo might have built for its search..

I see their logo on the main Solr page [0]. They must be using it in some capacity still.

[0]: https://lucene.apache.org/solr/

Word is they use bing api for web search.

The technologies are still heavily used but there might be slightly less demand for the skill set because:

1) Cloud search service - You are less likely to deal with setting up your own instance and concerns that go with it (sharding, etc) because most Cloud providers offer either ElasticSearch as a service or some form of turnkey deployment. You still have to do some low-level work like score manipulation but you don't deal with as much administration.

2) Search as a Service - https://en.wikipedia.org/wiki/Search_as_a_service. There are several companies that provide a Search SaaS offering. Typically they provide value adds above just running your ES service. Often they will provide web crawlers so you can just point them to your domain or they might provide other datasource integrations like pulling content from a database. You get access to Solr/ES functionality if you want it but you can get search running without going to that level if desired.

Either way a Lucene based stack is still in use.

About a year ago I set up an ES cluster to load Apache logs into as an experiment. Around a month later my boss asked if it was down. "Yeah, it looks like it, since you are using it let me upgrade it from an experiment to critical!" Since then we've been using it in more and more places, and are looking to see if it would be a good fit for storing some of our user data in. The big barrier right now is that we are running the version of SQLServer right before changed data capture became part of the standard edition, and that's the way we'd probably prefer to synchronize data into ES from our primary database. We have a couple home grown solutions to synchronizing secondary data sources, and we'd like to get out of that business. The SQL server is always going to be our source of truth, probably.

>The SQL server is always going to be our source of truth, probably.

ES is not supposed to be your source of truth.

Definitely agree - ES is designed more like a search appliance - it definitely should be pushed data from other databases that are the source of truth.


In the early days at least, cosmosdb was built on es. Dunno about now though.

Can you elaborate on the reasons for this?

Coming from experience with Solr, the answer is far simpler than the link in the sibling comment indicates: Data is processed and manipulated on import, and it's difficult - sometimes impossible, depending on the field config - to get that data back out in the format it was imported.

Any changes to the schema, such as switching fields between indexed/not indexed/stored/not stored, requires reimporting the data to populate those fields, data which you're not likely to have if it was your primary store.

Elasticsearch has a _source field [1] that stores the entire original document and is enabled by default. It's required to support features like highlighting in results. ES also has a reindex API that specifically makes use of this [2].

1. https://www.elastic.co/guide/en/elasticsearch/reference/curr...

2. https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Here is a very good Quora answer on why you should never use ES as a central repository for data: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...

Search gets less attention these days, but mostly because the current tools work so well. Scaling Elastic is still a bit of a dark art, but our company likely wouldn't exist without lucene/elastic. They take a bit to learn and use correctly, but they are incredibly powerful.

What do you mean by saying scaling Elastic is a dark art? Care to expand on that?

I'm by no means heavily experienced in this, but my current employer has a big demand for this. (e-commerce)

First there's the Docker + Kubernetes architecture that ES lends itself to really well. Then (depending on your use-case) there are concerns like hot/warm architecture, node types, ETL/indexing processes. ES recently moved over to openJDK, so there's a couple intricacies there (i.e. JVM heap size)

Then, there's document/query structure. In no particular order:

- Do you have any parent/child relationships?

- Do you have stop-word lists developed?

- Can search templates help your queries?

- How will you interface with ES? It has REST APIs, but it's recommended to not expose ES directly to your applications.

- Some advanced querying possibilities like customizing tokenizers, normalizers, and a bit of internationalization.

- Oh, we haven't even discussed security yet.

- Also, ES isn't meant to be a primary data storage. This is more so a "cache", but not quite like Redis. So, you'll need a DB elsewhere most of the time.

All of this changes depending on if you're using it for SIEM, e-commerce, AI/ML, etc. Also, Elastic now provides their own SIEM solution, a pre-built search solution (AppSearch + Search UI), built-in security features. Check out the new ES 7.2 update; it's kinda nuts.

> ES recently moved over to openJDK, so there's a couple intricacies there (i.e. JVM heap size)

My current employers uses ES - we're on 6.8, planning to move to 7 in a few months. Judging by the other replies here I'd say we have a reasonably large cluster (150+ i3.2xlarge instances, billions of documents), so tuning the cluster is very relevant to us. Could you expand on how things have changed with the move to OpenJDK?

I've seen some claims online that, contrary to what Elastic recommends in their docs, a few machines with huge heaps (100+ gb) is the way to go, rather than many machines with 20gb heaps.

>I've seen some claims online that, contrary to what Elastic recommends in their docs, a few machines with huge heaps (100+ gb) is the way to go, rather than many machines with 20gb heaps.

Usually the recommendation is less than 32GB - this link has some more discussion about it: From https://discuss.elastic.co/t/es-lucene-32gb-heap-myth-or-fac...

It seems whether it's better or worse depends on your data set . But I would love to see tests of different kinds of workloads with large or smaller heaps.

It's multivariate calculus.

Also, you have plan ahead and over-allocate or deal with fixed indexes/datasets. You also have to religiously monitor the garbage collection and deduce what's going on with search & indexing performance. When the situation changes you need to scale your cluster and re-index everything, which is not a trivial thing at most companies. I've seen bad situations at companies where it takes days to re-index a cluster and they're dead in the water until it's done.

And that's just the operations side. You have to make sure your data is flat (because nesting creates subindexes for Lucene and kills your search performance), that you only define in your index template the fields that you want to be searchable (and binary blob the rest), etc.

Within my current organisation, two main search options are Solr and Elasticsearch, both based on Lucene.

Generalising a bit, Solr is more targetted at enterprise search and unstructured content search (e.g. bundled with most content management systems), and Elasticsearch is more targetted at data analytics and structured data search (e.g. with the ELK stack). Again simplifying a bit, Solr can be a bit more configurable and/or work better with the types of data that benefit from more configuration, and Elasticsearch can work better with the types of data that work more "out of the box".

I'd agree that there don't seem to be a huge number of openings for specialist search roles or a huge number of people specialising in search, but it is often part of another role and there are often people who have touched on search in their roles. That suggests that many people are just using it with largely default setups. Having said that, things like advanced relevancy tuning, if you need it, is a very much under appreciated skillset, and definitely needs someone with good experience or ability to learn.

I'm using Solr for https://www.findlectures.com, but I think Vespa looks interesting - lets you store feature vectors in the index, so you can do neat things to incorporate ML algorithms in ranking.

That's right! Vespa looks cool. I really appreciate that they even implemented a use case as a proof of concept.

Feature vectors do tend to get incorporated in relevance tuning (regardless of the engine), but from what I've heard of Vespa, features (and ML in general) are first-class citizens, whereas with Elasticsearch and Solr, text statistics are your first-class citizens, and you're adding in additional features and integrating ML at the periphery.

We do at Vimeo: I gave a talk this week at the NY Elasticsearch meetup about a new search product we built using ES. If you're interested you can access the recording of the livestream here: https://vimeo.com/348443979 The product in question is https://vimeo.com/stock

Yes but the search market has shifted.

We're at a point where Lucene and family are used for increasingly sophisticated use cases. The commodity end of the market used to be dominated by open source (Solr connected to Drupal for example)

Now for commodity sites there’s so many SaaS search products it doesn’t make as much sense to hook up Solr or ES to make your blog or university website or whatever searchable. A lot of basic search use cases are covered by products where you don’t want to have to hire a team to manage search.

But at the higher end apps with search, customization, especially of doing domain specific relevance at scale, is often a product differentiator (but often not so important or weird you should write your own engine). So this is where these systems thrive...

100% agreed. This is what my company uses ES for, and it's exceptional at it.

SOLR is great but it's a pain to manage it in the cloud. If you lose an EC2 instance, there is manual work involved when you bring up a new instance. You have to tell the new instance servers which shards they're going to replicate. If the EC2 instance hosting shard1 replica2 goes down, you can't just bring up a new instance and have it be replica2. You need to use the API(which is just a call to a bunch of URLs) to get the new instance to be part of shard1. Also, a good cloud overview UI would be nice. 8.1.1 does have some improvements.

Also, SOLR speed is almost directly proportional to disk speed. If you index is on solid state drives with high iops, you'll be fine.

Backing up a large index is a little painful too.

We have been using https://www.algolia.com/ completely as a replacement for ES.


- Managed search engine

- Great API / Developer experience


- Cloud only makes it hard for local development

- Expensive (I guess it depends on the usage)

I have been working on an open source alternative: https://github.com/typesense/typesense

Would love to hear your feedback :)

Worth noting this is what hacker news itself uses

you mean https://hn.algolia.com/? thats not made by the HN team though, wouldn't be surprised if the algolia guys built it to show their tech, which is a very smart idea anyway.

It is a similar cost to elastic/solr cloud options, cheaper if you have to get to feature parity with Algolia.

Oh that's a good point. I was referring to Algolia vs self hosted ES and that's why I pointed out that it depends on the usage (how much power you need and how many people to maintain it).

I use Solr & ElasticSearch heavily — they're “boring” in the sense that they do a lot of heavy lifting without many surprises and they scale easily into at least terabyte-sized indexes.

One area where this might be less true is that the full-text search in Postgres & MySQL have matured to the point where some basic applications might reasonably decide that it's not worth using a separate service.

Elasticsearch is very popular because it works well for generic searching and can be customized for lots of unique scenarios. There's competition on the infrastructure side using something other than Java/JVM though:

For Rust, there's Toshi: https://github.com/toshi-search/Toshi which is built on top of Tantivy: https://github.com/tantivy-search/tantivy

For C++, there's Xapiland: https://github.com/Kronuz/Xapiand

For Go, there's Blast: https://github.com/mosuka/blast built on Bleve: https://github.com/blevesearch/bleve

Yes. They might not look trendy anymore, but they are still heavily used in the industry. I constantly see in the industry use cases where sold or ES would be much better choice, but those options are simply ignored because they are rarely visible at the top of tech publications.

We use elasticsearch to power our ecommerce search, and it works pretty well, but we're considering moving to a commercial product, or solr, to get closer to personalized results based on our knowledge about the user.

We just rewrote our internal search API from a windows service indexer with lucene indices and a vb.net SOAP api in iis to a netcore service, hosted in k8s, that splits out ingest, analysis, storage and queries into separate domains, with the writes going to Azure Search Service*.

Our use case might be a bit weird -- this app is essentially an internal API that supports the search needs of our other teams and their own products for our own internal software. It probably has 30 million records across a few different indices. We made the decision to migrate from lucene because of the ease of clustering elasticsearch. We previously achieved availability by just running multiple copies of the standalone service and doing smart health checks at the load balancer level in case a lucene index got corrupted and needed to spend a day rebuilding, but that didn't scale well for rebuild times, and we have been consolidating all of our legacy tech onto netcore and kubernetes.

Raw lucene was an order of magnitude faster than azure search service, but that's probably more a function of being able to essentially query the indices directly in memory of the webservice, as opposed to a slightly underprovisioned search cluster with all the HTTP overhead. We're migrating it to our own elasticsearch cluster right now for performance, cost savings and cloud-agnosticism.

We have an early access product for personalized ecommerce search @Sajari if you are interested. One early access company is on track to generate $30 million in additional revenue from switching (over 10% search conversion increase). That is across millions of skus and hundreds of products updated per second also.

We are also looking at releasing this as a k8s deployable product. It's all k8s services and gRPC already...

For e-commerce search, personalization with Elasticsearch takes a similar level of effort as with Solr. Don't re-platform under the impression it will make personalization easier. It still takes data collection and experimentation but can be accomplished on Elasticsearch. Feel free to contact me if you have questions.

Two years ago I decided to go with Postgres' built-in fulltext search instead of adding another dependency like ElasticSearch, and I believe I've profited from that in much less maintenance while still getting quite good performance/features.

Do you use ts_rank? PostgreSQL FTS is very efficient until you want to rank the results according to their relevance. This is because the data necessary to the ranking are not in the GIN or GIST index. They are in the heap, and this triggers a lot of random IOs.

Ah, this is good to know. My site doesn't yet need to scale, so this is definitely A Problem I Would Love To Have ;)

EDIT: This seems to help with the ranking problem: https://github.com/postgrespro/rum

Yes, RUM is great. I'd hope it will be built-in one day.

Any tips for scaling Postgres-only fulltext search?

Good find, bookmarked!

I wish I needed to, let's just say that!

We use ES heavily. Most of our queries are basic document filtering plus some geospatial stuff. We could probably have done it with Postgres/PostGIS, but with AWS manages ES, it's all "good enough" -- we can do geospatial searches on millions of documents with response times around 100ms. The other part I like about ES is how it's easy to scale out across machines, which lets us handle quite a bit of load and tolerate failures easily. We have a cluster of 5 m4.large instances and it only runs us about $600/mo. Like others have said, tuning AWS ES sucks, but it's always been good enough for us.

We've run into some pain points like trying to index very large shapes into a geospatial index, but have workarounds for basically everything now. We also had a problem where when AWS had the outage around autoscaling groups a few months ago, we lost 3/5 of our instances and had to reindex some data from backups. That was the worst thing that's happened.

I'm sure there would be better/faster/cheaper ways of doing what we do, but for what we get out of the box for the price, it's going to take a lot for us to move away from it for now.

Yep, most of the audience facing search functionality across our sites (https://www.bbc.co.uk) is powered by various Solr clusters, hosted on-prem and cloud.

ES powers search one of my side projects https://dealscombined.com.au.

The ability to not only full text search but to do it fast, to tune the lexical behaviour (lowercase, plurals, stemming etc.) and to top it all combine geo search pretty much left any other solution in the dust. I even considered some paid solutions.

I also considered postgres which looked strong but I felt it’s be harder to set up these features and that the full text would be weaker although geo might be stronger but my geo needs are simple.

ES was easy to set up to do this, taking about 2 hours of tuning. I used AWS so I didn’t have to figure out how to install it. I admit I had a mental model of ES from ELK-ing at work.

At some point when the site gets more traffic I’ll tune the search so that rather than nearest matching, I’ll score bit the distance and the words and order by perceived relevance. Ie weigh up both how close something is with how well the words match.

ES is a pretty amazing tech and it’s the easiest way to set up a decent quality free text search for your site.

yeah, there aren't many alternatives unfortunately. I've used Sphinx a lot, but am now stuck with ES and it is horrible to operate, probably because we don't need a cluster solution so it is total overkill. Yeeeah for technical debt.

For small projects (for some 1000s of documents), I'd probably go with Postgresql FTS if possible. Sphinx/Solr for anything with indices smaller than a couple 100GBs After that, ES seems reasonable & worth the overhead

EDIT: my biggest issue with ES is that it seems to be specifically engineered to sell you support. So get a managed version if you can.

This was exactly my pain point as well. For smaller projects, ES is a overkill. So I decided to do something about it! I started working on an open source, really, developer friendly search engine that just works. It's pretty stable now and quite a few people use it and like it: https://github.com/typesense/typesense

Would love to hear your feedback.

> I'd probably go with Postgresql FTS if possible. Sphinx/Solr for anything with indices smaller than a couple 100GBs After that, ES seems reasonable & worth the overhead

why do you think you can't throw 1TB of data on postgresql?

Could postgres FTS handle millions of documents within a reasonable timeframe?

Yes, it's searching against the ts_vector data types which can be indexed.

The problem with PG FTS is that it doesn't have advanced search functionalities (fuzzy matching, faceting, term distance, highlighting results) and it lacks the modern relevance scoring systems so that'll be the limiting factor instead of speed.

At Sematext we help companies with Apache Solr and Elasticsearch. ES/ELK is definitely used mode for timeseries sort of data. Solr community puts more focus on full-text search (email search, product search, database search, etc.). Elasticsearch can do that, too, and we regularly help companies who use ES for that, but Solr seems more focused on that use case.

I work at a company who's a major player in online academic publishing.

We use Solr to power our main, end user-facing search after migrating from a custom Lucene solution some years ago.

To me it seems Lucene based tools are the best for the job if the main thing you care for is having text focused search with a huge potential for extensibility.

But there are a lot of use cases where you will never need anything more than the base capabilities of this technology (so you can be served by something simpler to use or maintain nowadays) and there are probably a lot of use cases where your search will be mainly driven by vector similarity (in which case you are working around the limitations of picking a technology with another focus).

As far as jobs go, I'm not sure how in demand specialists are. After a few years of working in the field I had a look to see if I could leverage my experience to get a remote position and came up with pretty much nothing.

The Sitecore CMS moved from License to Solr for search for on-prem instances. After trying to run it on Windows we were happy there was a third party provider that was easy to work with.

For Sitecore I will shamelessly plug Coveo for Sitecore: https://www.coveo.com/en/solutions/coveo-for-sitecore

I think that we definitely have the best integrated and most full featured solution for Sitecore customers.

It's not only about querying but also having a UI framework, built-in customizable indexing, analytics tracking and access to machine learning available in one package.

Source: Am product manager for Coveo for Sitecore.

Based upon community responses (and someone from the organization being in the Slack group and responding as well), we ended up going with SearchStax.

I work at a state university (with the associated purchasing ... issues :)) and we had some staffing issues, which they were very accommodating of from initial quote to subscription.

If we end up running into issues as/if we expand our usage (we're essentially only using Solr for the mandatory bits) I'll keep Coveo in mind. :)

We use solr/lucene. It provides search and indexing for our CMS. It will in all likelihood be retired when we change CMS, it's scheduled for autumn next year (yeah right).

Yes, we use it as part of an ecommerce framework to index products, categories and cms content for various customers. However we dont have a specific Solr position, as we only need to make small adaptations which an average developer usually can do/interfere from existing code.

ES is backing the product search at ecommerce site https://www.imusic.dk/. Even with 16M documents, a fairly intricate ranking function, and spelling suggestions, latency is in the order of 200 ms.

Lucene is still great today for smaller indexes that can entirely fit in memory and can be indexed quickly on app startup. Think something like searching for a setting in Windows 10 settings, or if you had some other fixed, small data set that you wanted to allow users to do real text search without the complexity of a search service. Lucene is still helpful here because of the analyzers, stemming, etc.

But for searching data that can grow and change over time, it's hard to justify using Lucene directly anymore. Azure Search (I believe built on Lucene) is an awesome (but relatively expensive) SaaS solution that is far easier to manage than Elasticsearch.

Search built using Postgres is underrated. It can do a lot if used properly.

there are on different levels, a search on a sql database will have real time results, while on solr / elasticsearch it's going to have a delay (from milliseconds to minutes). That delay gives the advantage to build a series of data structures much more suited for search than the ones on a database.

I built several search systems for classified listing sites, something like solr is a life saver once you get enough traffic. Is much easier to scalate than a sql database, and you can do much more things. The easiest example is a facet, for example you make a search on a car listing site, and you want to show how many cars from each brand you are matching, with sql you have to make another query, while in solr you can get those results in the same query. Now, add the model, the color, the gas type, the transmission, the place, etc. that actually grows easily to something unescalable, while with solr you can do easily.

ES search have no any delay, if data is really commited to index. That's also true for any general DB - Postgres etc.

We use Solr heavily at https://www.helpscout.com/, it really powers a ton of our functionality.

I'm still using an older version of Sphinx. I love it. It's fast, moderately flexible, very lightweight, easy to set up and produces good enough results. I have also found it to be highly reliable (at least the version I've been using for years). It's not useful for anything that needs hyper scale (Twitter et al), however for the next tier of scale below that it generally does well if you know how to leverage its strengths.

We use ElasticSearch at Lawmatics, and it powers more functionality than just our search! We use it to power our Automation targeting engine, reporting features, audience builder and pagination, filtering, sorting of data tables.

We denormalize associated records into one Index. And any record that we need to find based on user-defined queries will go through ES since it's much simpler to metaprogram queries across denormalized data (no conditional joins).

This is one of the undersung benefits of ES in my eyes. Relevancy results requires tuning of the indices and queries and in most cases (that non IR-experts would program) ES will give as good results and be as easy or easier to implement as Solr.

But after you've gotten over that, you realize that this new tool can do lots more things than just text search. Time series metrics, BI, predictive ML, APM, etc. with relatively little work. With Solr, you could do those non-IR tasks, but it's going to feel much more awkward, IMO.

We use ElasticSearch extensively at our company but we don't use it for full text search (in fact, we don't use its full text capabilities at all) but rather for its ability to match and aggregate large data sets without having to create any indexes at all prior (and it's fast, it still blows my mind a little). This allows us to offer customers a way for them to create arbitrary queries in our own little DSL.

My experience is mostly Japan-centric nowadays but SOLR is very widely used here and there is demand for people with that background. A lot of work has been done with SOLR to better support the intricacies of dealing with Japanese text which differs substantially from other languages. Most of the search and NLP jobs I've seen recently outside of Google and Amazon expect some SOLR experience.

This has been a great thread, and there's some heavyweight indexes here. But what about at the other end of the scale?

Say when you've got 10k-50k contact details (name, email, phone) and you want to provide a quick, autocomplete lookup. I've used basic SQL string matching for this, but it doesn't catch mis-spellings and the rest.

Running SOLR or ES is overkill for this. Is there a tool that fits this niche?

Postgres does lexemes and all that jazz natively.

You want to be looking for tsvector related stuff.

I used it a few years ago to do full text search on a smallish website and it was great, I gather it has improved further since.

Yes! Please take a look at an open source search engine I am working on. You will definitely like it:


Would love to hear your feedback :)

What's wrong with running Solr/ES? It is trivial to run either in standalone mode, and it is a lot easier to set up autocomplete with misspelling support than messing with PG. Algolia is a good option if you have the budget.

> What's wrong with running Solr/ES?

With this small quantity of data, usually the app's running on a small VM. I'm wary of running anything Java, having had it require large amounts of RAM before.

That said, I haven't touched JVM stuff for 5+ years.

Lucene should work great for this use case. It has been awhile but I have successfully used it for this exact use case.

With Postgresql you can use pg_trgm, might be not as powerful as what SOLR/ES provides, but easier to run.

Upvoted. This is how Postgres supports "fuzzy searching" which helps with misspellings. https://www.rdegges.com/2013/easy-fuzzy-text-searching-with-...

Completion response-time will be slower than Solr, Elasticsearch, Algolia, etc... but if you're already running Postgres, this may be the fastest to deliver for you.

This is a great niche for Algolia, instant fuzzy 'autocomplete' style search of short strings

While I was at Eventbrite we were using Solr and started moving to Elasticsearch. I know one of the main people I worked with on that recently left for Github, which also uses Elasticsearch.

At Mozilla I work one project with a search component (https://crash-stats.mozilla.org/), and it uses Elasticsearch.


ElasticSearch is widely used in enterprises for full text searching.

Wikimedia uses ES and you can download their entire index for any of their sites wikipedia/travel/quotes etc.

Yes. Also, just recently MongoDB v4.2 added Lucene as an embedded engine for text search capabilities

Do you know if it is coming to the community edition or is it only for Atlas?

We use ES for our searching data within our application. We store about 20,000,000 rows of data in the primary table, with plenty of dependent and secondary tables. ES takes the load off our MySQL cluster for generating reports and fulfilling searches.

The company I'm working is using Solr. We collect the data by either paying, for free or by using crawlers. Then we query the data from Solr if it is not a meta data. And there are other types of databases too that we use.

ES is somewhat popular in the enterprise WordPress space, driven largely by 10up's ElasticPress plugin which uses it both for search and to improve the performance of database queries over MySQL.

Atilika (https://www.atilika.com/en/) uses Lucene/Solr for their NLP based search products.

I'm using SOLR for public facing search API/engine on several projects (own and customers). ES is imho better for doing various on-demand analytics (like dev logs search)

ES is used in code search tool dxr which we locally use as well: http://dxr.mozilla.org

Data discovery: this is the key concept. they are absolutely unbeatable at this and for this you can use them for a lot of things. Http://siren.io

I guess more and more people are turning towards search engines directly integrated into databases, like ArangoSearch of ArangoDB to combine search with other needs https://www.arangodb.com/why-arangodb/full-text-search-engin...

I assume you work for ArangoDB?

Looking at his/her post history, he/she does but rarely discloses it.

Yeah, I work pretty much daily with Solr and related stuff (PHP, JSON). Still used quite a lot in PHP and Drupal scene.

Yup. Lots of Drupal sites use Solr. There are very good contributed modules that make using Solr with Drupal a doddle.

Same here - using SOLR with Drupal and it's pretty simple and effective

Yes, I work on the search team at my work and we extensively use Elasticsearch for many different search services.

Using SOLR heavily. I have a strong preference for SOLR when it comes to Full Text and ELK when it comes to logs.

I've seen elastic search used as a document indexing engine and thus also as search engine.

Also solr, to although a lot less.

Are people still using Sphinx Search (http://sphinxsearch.com) at all? It doesn't seem like it gets many releases anymore...since they unpublished the source code, it's hard to see how much activity there is.

https://manticoresearch.com/ is the lively, open source, fork of Sphinxsearch. that's where some of the earlier developers from the project moved to. it's used as a text-search backend on craigslist.

Definitely. I love me some manticore. :-)

this is cool! Definitely will give it a look.

I'm using a 2.x version of Sphinx and have been for many years. I refuse to upgrade the version. I haven't been able to break/crash the version I'm using under nearly any common circumstances or loads, so I'm sticking with it until I find an alternative that is dramatically better. I've learned every nuance of it over time and can make it sing & dance exactly how I want it to. I consider it a spectacular piece of software; it does a thing and does it well, reliably.

We just migrated from Sphinx to using the full text search indexes in PostgreSQL, we had to deal with some changes in how special characters are handled, but it's worked well enough.

As far as I remember few years ago they didn't have BM25 and even TF-IDF support. Have they added that? Are you experiencing any issues with full-text search quality after migrating from Sphinx (you probably used BM15(+F) which is BM25 w/o doc length).

Do you have any numbers on your requests or searches per second in a real use case? I've really been wondering this as I've been considering manticore which is the major sphinx fork.

If you need high query rates, I suspect manticore will stand up quite nicely.

We use it on https://dealbert.net

I see you don't have autocomplete in the search box. You might be interested in this interactive course https://play.manticoresearch.com/simpleautocomplete/ It's about Manticore, but may be used with Sphinx too.

We use ElasticSearch version 1.7 at postjobfree.com

The reason to use version 1.7 -- is faster percolation (reverse search).

MongoDB just launched beta support for lucene index based search native to its MongoDB Atlas platform.

Still using Lucene here searching through short text documents in a Java based server product.

ES has far more use than devops - most common usecase ive seen is tagged document searching

Yes, quite a few of my clients use ElasticSearch extensively outside of the ELK stack.

Data point: We're actively working on migrating from Autonomy to Elastic Search.

Nobody seems to have mentioned that GitHub uses ElasticSearch.


Does anyone have any production experience for Yahoo Vespa? I've heard it competes with ES in opensource search.


Xapian. C++ based. I really like the API.

I came looking for a Xapian comment. It is HIGHLY underrated.

I love the "just works" functionality and portability.

At a previous job we had two existing options for search, Postgres GIN or the heavyweight ES cluster with Kafka. When I recommended grabbing Xapain for simple indexing (2-4k records, toss the index whenever we needed an update was OK) no one would bite.

Making use of the ELK stack currently

A lot of ElasticSearch is in use at my work for search feature work.

We are moving from FAST to ES. Major pain.

SOLR for the full-text win.

Lucene is the carry.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact