Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Elastic, Loki and SigNoz – A Perf Benchmark of Open-Source Logging Platforms (github.com/signoz)
121 points by pranay01 on Jan 24, 2023 | hide | past | favorite | 72 comments



I'm not a fan of competitors creating benchmarks like this as when faced with any tuning decision, they will usually pick the one the makes their competitors slower. But anyway lets take a look at how they tuned Elasticsearch.

Disclaimer I used to work at Elastic!

- Used Logstash instead of Beats for simple task of reading syslog json data. Beats (https://www.elastic.co/guide/en/beats/filebeat/current/fileb...) would have performed better especially around resource usage.

- Set very low Logstash heap of 256mb https://github.com/SigNoz/logs-benchmark/blob/0b2451e6108d8f...

- Added grok processor https://github.com/SigNoz/logs-benchmark/blob/0b2451e6108d8f... Dissect is faster here

- No index template configuration This would cause higher disk usage than needed due to duplicate mappings. Again a Logstash vs Beats thing. For this test more primary shards and a larger refresh interval would also improve things.

- Graph complaining Elasticsearch using 60% available memory. This is as configured, they could use less with not much impact to performance.

- Document counts do not match.. This is probably due to using syslog with random generated data vs creating a test dataset on disk and reading the same data into all platforms.

- Aggregation queries were not provided in repo https://github.com/SigNoz/logs-benchmark so cannot validate.

I'm actually surprised Elastic did so well in this benchmark given the misconfiguration.


thanks for the note. Our approach for this benchmark was to use the default configs which each of the logging platforms come with.

This is also because we are not experts in Elastic or Loki, so we won't know the possible impact of tuning configs. To be fair, we also didn't tune SigNoz for this specific data or test scenario and ran it in default settings.

> Graph complaining Elasticsearch using 60% available memory. This is as configured, they could use less with not much impact to performance.

This is something we discussed about, and have added a note in the benchmark blog as well. Pasting again for reference

> For this benchmark for Elasticsearch, we kept the default recommended heap size memory of 50% of available memory (as shown in Elastic docs here). This determines caching capabilities and hence the query performance.

We could have tried to tinker with the different heap sizes ( as a % of total memory) but that would impact query performance and hence we kept the default Elastic recommendation


Part of the issue is, Elasticsearch isn't an open-source logging platform--it's a search-oriented database. Effectively using it as an open-source logging platform highly depends on the config vs things optimized only for logs out of the box.

I imagine you'd have similar issues with Postgres or any general purpose datastore without the correct configuration.


I'm not an Elastic expert either, just a developer responsible for a lot of things that can Google pretty good, and I knew those configs seemed off. I've been hearing for years that Beats is preferable over Logstash. I don't even claim to work in the logging space :-)


Unfortunately it's severely misunderstood in the benchmark how Grafana Loki should be queried for high cardinality data. See also https://github.com/SigNoz/logs-benchmark/issues/1


Thanks for creating the issue. Yeah, this is what we also found, that Loki is not designed for querying high cardinality data.

But since Loki is many times used in observability use cases, where there is sometimes a need to query high cardinality data, we thought to include it.


That's incorrect, Loki is designed for querying high cardinality data.

The difference is that in Loki the index is only used for metadata around the source of the log lines (environment, team, cluster, host, pod etc) for selecting the right log stream to search in.

Parsing, aggregation and/or filtering of log lines on high cardinality data is all done at query time using LogQL. See also https://www.youtube.com/watch?v=UiiZ463lcVA and this live example where a 95th quantile is calculated using the request_time field of nginx logs https://play.grafana.org/d/T512JVH7z/loki-nginx-service-mesh...


This is kind of the issue with an interested party/vendor running benchmarks like these. Be it by pure dumb luck or malfeasance you are much more likely to configure and be knowledgeable about your own product than the others and toss out responses and results that are wildly inaccurate/misleading.


> While ELK was better at performing queries like COUNT, SigNoz is 13x faster than ELK for aggregate queries.

The author should also mention how much ES was faster against SigNoz with trace_id fetches (137x) and fetching first 100 logs (14x). Aggregating queries is known pain point for ES and will always be, due to ES design. People use additional tools for this, like Kafka Streams or Spark.

> ClickHouse provides various codecs and compression mechanism which is used by SigNoz for storing logs

What was "index.codec" for ES? Unfortunately, the default value does not provide the best compression ratio.

I won't say that ES (or OpenSearch) is perfect, but I was surprised it holds well here, considering ES was run in (I'll presume) a non-optimal environment. First, put Kafka instead of Logstash (or in front of Logstash), and your ingestion rate will skyrocket. The second, learn how to tune JVM.

Also, the author should use OpenSearch [1] because that is the place where all open-source development is happening now.

[1] https://opensearch.org


What about index mapping, how many primaries, how many replicas, index rollover. Is your hot tier optimised for ingest, warm tier optimised for querying, and cold tier optimised for storage? There's so much to think about to get Elasticsearch running "optimally" and to keep it that way.

It highlights the operational cost of running Elasticsearch.


Hi, I think this question is pointed towards Elasticsearch,

But here are some points for SigNoz. ( I am one of the maintainers at SigNoz)

> Directly ingesting to disk(hot tier) is faster than directly ingesting to s3(cold storage)

> The query results were an average of cold + hot run ( for elk as well ). We didn’t have an explicit concept of warm storage for SigNoz in our benchmark.

> The query perf for logs with cold storage is almost similar to hot storage, but the operational cost will reduce with cold storage. So ingesting to host storage and moving to cold storage after a certain amount of time is a good option for Signoz.


Any tool handling large amounts of data has an operational cost.


Sure but in the end it boils down to whether that operational cost is worth it for the value received. Tools like Loki, are worthwhile alternatives for centralising infrastructure logs with lower operational costs.


ES isn't cheap to start with, and I agree with you on that, but it is straightforward to scale after you go above 3-6 nodes. ClickHouse is easy to start with (a single server), but not so much with unobtrusive scaling up or down.

> What about index mapping, how many primaries, how many replicas, index rollover. Is your hot tier optimised for ingest, warm tier optimised for querying, and cold tier optimised for storage?

Yes, there are details in this, but like every truly distributed system, you can't just plug it in and hope it works in the most optimal way. Also, regarding hot/cold storage, AFAIK, ES can do it after the fact, but with CH, you need to plan it in.

> Sure but in the end it boils down to whether that operational cost is worth it for the value received. And as ELK is commonly used to centralise infrastructure logs, tools like Loki, SigNoz are becoming worthwhile alternatives.

Actually, it boils down to whether you plan to grow or not, and tools like SigNoz or Loki has their place for sure. For example, for centralized logging, if you have a few servers and keep it that way for the next N years, ELK might not be for you. But, if you suddenly end up with 100 servers, ML team and would like to drill through logs and other data to get more insight on everything, moving to ELK will be way pricier than starting with it.


Thanks for the feedback. We chose Elasticsearch as in our experience it is still the default tool for people to get started with logs. But do understand that may be opensearch is also catching up now.

>The author should also mention how much ES was faster against SigNoz with trace_id fetches (137x) and fetching first 100 logs (14x).

We didn't mention this in the summary as for the scale we tested at the difference would not have been perceived by a user. e.g. For getting logs corresponding to a trace_id (log corresponding to high cardinality field), SigNoz tool 0.137s, and Elastic took 0.001s

I think we read somewhere (will try to find source) that anything below 200ms in server response is not perceived by user


That sounds like an excuse. I do agree they’d likely both feel more or less instant though.


You can also find Elasticsearch vs Clickhouse performance benchmarks for log data on db-benchmarks.com [1]. The corresponding article is here [2]

[1] https://db-benchmarks.com/?cache=fast_avg&engines=clickhouse...

[2] https://db-benchmarks.com/test-logs10m/


This very interesting, will go through it. Thanks for sharing.


Isn't elastic NOT truly opensource since 7.10?

You are using 8.4.3 and you are building on top of it. have you checked the terms & conditions?

OpenSearch is an open source alternative

https://opensearch.org/ https://www.elastic.co/pricing/faq/licensing


I may have missed something, what have they done that builds on top of 8.4.3? SigNoz uses Clickhouse as its storage backend.


>You are using 8.4.3 and you are building on top of it. have you checked the terms & conditions?

We have just used Elastic 8.4.3 for the benchmark. We are not building on top of it. So, I am not sure how do the T&Cs apply. Can you share more?


The Hacker News title says "open source" in it... but Elastic isn't open source, so it shouldn't qualify... though the article doesn't say open source... so maybe the Hacker News title was editorialized and is incorrect?


Free software is loaded with terms.

Open Source means concretely: that you can read the source.

Free and Open source usually means you are free to use it in many ways, sometimes with restrictions or agreements that contributions must be made available. This freedom is the point of contention with ElasticSearch and MongoDB.

Sorry to he pedantic, but terminology is important.


Yeah, no. This is HN and free software & open source are established terms of art in programming, and Elastic is neither free software nor open source software

Fortunately Elastic already has an appropriate term for it: source-available software [0], and that's the term Elastic itself uses to define its license in their FAQ [1]

[0] https://en.wikipedia.org/wiki/Source-available_software

[1] https://www.elastic.co/pricing/faq/licensing#what-is-sspl-an...


I saw that Cloudflare moved from Elastic to Clickhouse for logging.

https://blog.cloudflare.com/log-analytics-using-clickhouse/


Yes, and the perf improvements they have achieved is also staggering. In the linked presentation in the blog they have mentioned -

> CPU and memory consumption on the inserter side got reduced by 8 times.


Clickhouse go brrr basically.

Saw similar results on a hand-rolled version of this purely for logs. Nice to see an OSS solution that also bundles in the other bits of the observability stack into Clickhouse.


thanks. Yeah, ClickHouse does have quite a good perf esp. for observability user cases where aggregate queries tend to dominate.

Here's a blog from Uber where they has 70-80% aggregate queries in production env, as saw 50% improvement in resource required.

From their blog `We reduced the hardware cost of the platform by more than half compared to the ELK stack`

[1]https://www.uber.com/en-IN/blog/logging/


The system I worked with was acquired by Uber but built independently of that solution, they were constructed -very- similarily. (I worked at Uber for a short time after it was acquired).


What schema does SigNoz use with Clickhouse? The Open Telemetry Collector uses this schema https://github.com/open-telemetry/opentelemetry-collector-co... and I found out that accesing map attributes is much slower (10-50x) compared to regular columns. I expected some slow down but this is too much.


SigNoz also follows a similar approach since the attributes can be arbitrary, and ClickHouse needs a fixed schema ahead. The options are map, parried arrays etc.. but they all are slow depending on the object unpacking ClickHouse needs to do. ClickHouse does its best on the regular columns as it's built for it. If the access is on Map/Array types, it is faster than other DB systems but slower than regular columns.


Hmm. I think benchmarking this sort of thing well is pretty hard. I’ll note:

- these clusters are tiny (circa 4 machines). eg one thing that is difficult about looking for alternatives of what we currently have is trying to guess how they would perform at similar size. It is complicating when one requires rack space and millions of dollars of machines for a comparison cluster

- there are lots of tuning options. If another system performs poorly, is that because you don’t have many years of experience of tuning it well

- perf will be quite sensitive to the shape of the data and queries. Here I got the impression that the data is very uniform, there are a small number of different fields, and most fields will be set on each log line. If you have many different teams producing different log lines with different fields, then dense representations won’t work as well as they could for this format, for example.

The experience reports from Uber/cloudflare are useful. One worry is that it is much more common to write a blog post about switching to some new system than to write one about how the shiny new system turned out to have significant flaws that needed to be worked around.


Agree to your points.

Performance benchmarks are not easy to execute. Each tool has nuances, and the testing environments must aim to provide a level playing field for all tools. We have tried our best to be transparent about the setup and configurations used in this performance benchmark.

I think you would need to test different solutions for your specific data and query patterns to understand better how different tools would fit.

But hopefully our results could give you pointers on where to poke


Curious how this compares to S3 + Athena (object + Presto or Apache Drill). ES always seems like a bit of a weird fit for logs since

- it's optimized for repeated searching (logs don't tend to be searched very often) - it isn't optimized for aggregation (ad-hoc metrics) - it has a fixed schema (logs can but require effort)


Thats a good question. We have not evaluated this stack for logs yet?

Do you `S3 + Athena (object + Presto or Apache Drill)` stack currently for logs? What do you like about it?


Not currently but I did at a different role. It's dirt cheap and almost infinitely scalable. There's not all the added complexity of running and tuning something like ES


One thing that I would want to see in this kind of benchmarks is the performance of these engines when data is stored on an object storage.

The main reason for that is that the amount of logs, metrics, traces data can be huge...

I think Loki was made to work on object storage.


+1 Loki is designed for object storage as backend. Persisting all data (so not storage tiering) on object storage vs local storage gives you cost savings, increased durability and simplified operational at scale.


ES works on object storage, too afaik but it's a paid feature


am interested in the `dummyBytes` generated by flog[2], in the linked benchmark result[1]. it's random words from `/usr/share/dict/words`, which may not be highly compressible..

i.e., 500 GB -> 207 GB (with zstd data compression + indexes) seems like a worst case scenario. with "real" logs, I am expecting this to be much better (for logs at least)..

does anyone have a similar size comparison with real life examples? (similar data size, interested in compressed logs size and indexes size with clickhouse)

[1] https://signoz.io/blog/logs-performance-benchmark [2] https://github.com/signoz/flog [3] https://github.com/tjarratt/babble/blob/cbca2a4833c1dd0e0287...


Yeah, agreed. This is a worst case scenario. We also expect compression to be better in real life scenario data


Disclaimer: I work for Lightrun, a dynamic instrumentation (read: add logs at runtime) tool.

Logging is a surprisingly pricey bit of observability at scale.

If you're doing anything highly-transactional you're pretty much guaranteed nowadays to get major observability bills (for things like ingestion, transmission, storage and analysis/querying).

It also creeps up on you - you're used to getting billed incessantly for cloud stuff, monitoring doesn't sound too pricey. Until, well, it is.

Can't help but point out that this is static logging, meaning logging added during development. This stems from an the approach colloquially referred to as "log everything, analyze later", rather than more disciplined, case-specific logging.

We're used to add "just-in-case-logs" to ensure that we've got ourselves covered when the sh*t hits the fan, but we rarely look at them.

An alternative approach would be using a tool, like the one Lightrun[0] builds (see disclaimer above) to add Logs in real-time to applications.

This means that instead of logging IN ADVANCE (i.e. during development) you can add logs when and where you need them, in real-time. The tool works using a variety of dynamic instrumentation techniques (depends on the runtime), and currently supports Java, Node.js & Python (.NET soon).

It can also pipe these logs into Loki, Elastic or SigNoz since they can be dumped like normal logs, right into the stdout.

In any case, we're seeing mass reduction in static logging (up to 60% in volume, 40% in cost) by going dynamic instead of mostly static. You'll always have to log some stuff for forensics and traceability, but dynamic instrumentation removes a lot of the dead weight.

[0] https://lightrun.com


Can Lightrun dynamically add logging to an app in the past? If not, i don't really see this value of this.


Lightrun add logs at runtime, which means you'll add a log and it will be integrated into the stream of logs emitted from the application (or streamed to your IDE plugin) - it's not a time-travelling debugger, if I understand the question correctly:)


Thanks for this! This is really cool, I have a few questions. Can Lightrun prevent certain fields from being logged? Some fields shouldn't be logged such as credit card numbers, secrets, username etc. How can we prevent devs from accessing things they are not supposed to?

Can this replace continuous profilers as well?

Whats the reason this is not getting more adoptance? AFAIK most companies are either eating the cost of logging or managing it with sampling. This seems way better than either of those options, so whats the downside?


Per your first question - protecting sensitive information - we've got PII redaction and Blocklisting that cover a variety of potential cases of data leakage [0].

Regarding profiling - we're not actively a profiler, but we do offer a set of code-level metrics you can use to do performance analysis and detect bottlenecks [1].

Regarding adoption - we're doing OK:) I think this is a new approach, one that is quite different than what developers are used to. But it's also one of those things that - as you mentioned - make so much more sense than the alternatives, and with costs rising it's really even more evident than ever. If you're interested in the cost of logging specifically we've got it broken down in a study [2] we recently released.

[0] https://docs.lightrun.com/data-security/

[1] https://docs.lightrun.com/metrics/

[2] https://lightrun.com/resources/lightruns-economic-impact-on-...


Isn't that equivalent to setting the debug level in the code and scaling in/out the amount of data as needed with a push of a button?


Well, what if there was no log there to begin with?

If you've got a bunch of logs labelled as INFO, WARN and ERROR in your system, and you tweak the debug level - you get more detailed debug logs as need be. This will stream EXISTING logs to stdout and then to your favorite APM.

However, if the exact bit of information you wanted isn't there (that variable isn't logged, that specific piece of code isn't instrumented so you can't know it was reached, etc...) you're stuck.

In addition, sometimes (often) it's hard to correlate the exact path the code took, since it's not clear which condition or class or package or server were actually involved in the process of execution. This tool enables you to conditionally log just what you need in real time - basically "paint a path" through the code at runtime.

Hope that's clearer, can elaborate more if need be.


Do you have SDKs or some client libraries for lightrun on GitHub that I can look at?


Got a free tier you can play around with at [0] , and a few examples you can check out over at [1].

We've also got a zero-config (in-browser!) version of the whole thing [2], using code-server [3].

[0] https://lightrun.com/free

[1] https://github.com/lightrun-platform/lightrun/tree/main/exam...

[2] https://playground.lightrun.com

[3] https://github.com/coder/code-server


Can you explain in more detail how Lightrun works? It sounds like monkey-patching in real-time from what I can gather, which is neat but surely comes with some kind of overhead.


yeah even if they are wrapping with near noop guards all over the ast/codebase as its being compiled and the secret sauce is just to tickle the guard to log at those specific points dynamically you would think this would have huge overheads for the no log states as those guards get bypassed for hot code blocks. (kind of like running in a debugger env).

They note that this is patented -- I wonder how. Every way I can imagine this being implemented in the languages they support seems to be clearly prior well known art. Patent office granting more "RED -- we own compressing raw" type patents?

Looking at the patent seems like they are monkey patching either dynamically or into ast injected nodes at compile time. what the heck is the PO doing. also who in their right mind would open their prod services to an external party for code injection lol.


Great questions form all ends. Let me try and clarify.

This is not monkey-patching or hot-swapping, but a different approach - we call it dynamic instrumentation. This amounts to having an agent (an SDK/library, in essence) perform the addition of Lightrun Logs, Snapshots and Metrics, then (potentially) pipe the information where it needs to go. We've got a nice diagram here [0].

I think that the mechanism itself - which changes by runtime, naturally - is well explained in the link above. However, the core security mechanisms we have are enabled by another component of the agent we dub the Sandbox.

In essence, we've got a (patented) way to verify everything that we do at runtime. That means we ensure that each evaluated expression does not have side effects (like changing the value of a member of an array, or editing a variable value), every metric we could think of is throttled and rate-limited (that includes our usage of CPU, RAM, the rate of I/O and a bunch of other things).

Given this sandbox mechanism, and the way the networking requirements look like (again, look at [0] - no need to open a debug port / inbound ports, and a pretty agnostic deployment model) I think we've got a pretty robust defense layer against a variety of failures. Also see my comment a few comments above regarding data security.

[0] https://docs.lightrun.com/more-about-lightrun/#how


Love it when people put out benchmarks like this (even when not necessarily independently performed).

Thank you pranay!


thanks, yeah we were not able to find many benchmarks for logs solutions - and hence we tried to do our bit.

Performance benchmarks are not easy to execute. Each tool has nuances, but we have tried to be as transparent as possible on what we tried, and looking for feedback from community here on how to get better


Yeah honestly it’s really hard to do, and with every intermediate system there are tons of tunables and setup specific stuff.

The effort is commendable, and since y’all have provided code it’s much more reproducible


How is it compared to https://quickwit.io ?


also interested in this.


I'm working on such a benchmark. I'm scared of the difficulty to be not biased though.

(disclaimer: quickwit cofounder here)


Benchmarks are always "It depends".

And what it depends on are your data volume, how you want to query, whether you value ingestion greater than query speed and timeliness and so forth.

Elastic sweet spot is that it indexes everything, and you can query fast as a result. But it does this at the cost of ingest as it's doing the work to build indexes during ingestion and so ingest is more CPU intensive and can hit limits here. As a general purpose workhorse, Elastic shines.

Loki sweet spot is that it has very few indexes, so ingestion is cheap and extremely capable for huge data volumes. It does this at the cost of query performance over very large data sets - without indexes queries brute force via mapreduce, which means you really want to specify where to look (which log streams) and when to look (a time window) and in that it excels. For logs, Loki is heaven.

ClickHouse sweet spot is the indexes (columns) are very explicitly configured by engineers who know what the data looks like and how they want to query it. Now the ingest cost is balanced, and the query performance is great - but it did this at the cost of you knowing your data and how you're going to query it most of the time - it's not so good for esoteric questions that you'd never anticipated (though you can get very far through some of their column data types allowing you to be reasonably flexible on this). For BI data, ClickHouse is incredible.

They all have sweet spots, and a benchmark is not going to answer the real questions - what data volume do you have, what do you value (ingest and preserve everything vs fastest query speed for ad-hoc queries vs a balanced approach), do you know how you want to query the data, etc?

Other thoughts:

Loki has recently moved to TSDB for the backend storage, these benchmarks don't go there.

Elastic can use less disk if you configure for synthetic source (https://github.com/elastic/elasticsearch/issues/86603) which discards the raw byte copy of the ingested data and only retains knowledge in the indexes, and uses the indexes to reconstruct the source should you request it. Enabling this is only supported by ES for metric type data (according to docs) but it is possible to enable it for other types of data.

Nothing to add about ClickHouse, I've used all three databases and worked against all three for huge volumes of data - if I want a more OLAP style querying than OLTP and I know my data then today ClickHouse absolutely shines here with the others playing catch up. Elastic and Loki shine far more for OLTP workloads, so it's a trade-off again (though Elastic does a good job at doing pretty well for more OLAP cases than Loki does today due to it having column storage - and Loki wins on being able to ingest more cheaply - meaning being able to ingest more at the same or lower cost).

What you value and what you want to do is up to you.


thanks for the note. makes a lot of sense.

>I want a more OLAP style querying than OLTP and I know my data then today ClickHouse absolutely shines here with the others playing catch up

Curious, what type of OLAP workloads did you work with? Were there cases where you tried both CH and Elastic for the same use case and decided for oner over the other?


Again - "it depends".

The decisions I've seen firsthand (or been a part of):

CH vs ES: In petabyte sized stores of highly homogenised data (very few well structured types of data, i.e. "HTTP requests" where all logs were precisely following the same format and blended business data (i.e. the HTTP logs contained info on customer, tenant, etc)) and the primary goal was business insight and operator analytics - ClickHouse was the clear winner (capability to ingest at this scale whilst providing query performance at this scale). This was Cloudflare for the HTTP logs, L7 firewall logs and the L3 logs (sampled packet headers). They've blogged about them enough that there's no secrecy they are using CH for this.

Loki vs CH: In petabyte sized stores of logs from tens to hundreds of thousands of machines where every log is differently structured across apps, OS, device where the primary goal was to store all logs and provide operator observability - Loki was the clear winner (capability to ingest, control over storage costs).

There is no one-size-fits-all, and benchmarks are not useful in answering the questions that should be asked. Engineers should go back to first principles and evaluate systems and solutions according to their specific criteria. Some may even end on an entirely different solution than the ones you've outlined - PostgreSQL is pretty phenomenal too, and the full text search on that is great.


ES should re-write in Rust. Database engines created in GCed languages make no sense imo. Furthermore, Java is probably the worst choice of the GC big 3 (Java, C# and Go).


That's an interesting idea. Curious, if there are DB engines written in Rust today which people use at scale?


InfluxDB's newest storage engine, optimized for time-series data, is built with Rust (https://www.influxdata.com/blog/influxdb-engine/)


https://datalust.co/seq (my employer) uses a custom database built with Rust. We are about to remove the last C# database code because, as the previous commenter noted, garbage collection and databases don't mix.

Over the next few years I expect we will see a lot of new databases written in Rust.

https://blog.datalust.co/what-will-seq-vnext-look-like-on-th...


Here it is: http://github.com/quickwit-oss/quickwit

Quickwit targets big and append-only data use cases, log search and traces in particular.


Keeping Vector out of the benchmark game shows that Signoz couldn't beat it

https://github.com/vectordotdev/vector


Vector is not a database/data store and has nothing to do with this benchmark.


How does that compare? Vector is used for building pipelines, but this original post talks about the backend storage comparision


Be careful with benchmarks like this. From having worked a lot with Elasticsearch, I know it requires a bit of tuning and planning to get the most out of it and that that is a bit of a dark art for new users. Also, it is very good at querying and aggregation queries and doing that at extreme scale. So statements that X is Y times faster/slower than Z don't mean a lot. How much data are we talking? What kind of mappings, how many nodes, etc. These are apples and oranges comparisons unless you specify this in detail.

Several things you might want to look into when using Elasticsearch/Opensearch:

- Data streams and index life cycle management policies are crucial for time series data. Basically, with time series data you are mostly querying recent data. So you can benefit from limiting your indices to a size where things fit in memory. Like with any database index, you need that for your queries to perform. Opensearch has a competing feature that does the same thing. Basically with either feature you can ensure your data size does not grow beyond what your cluster can handle. Like with any data base, it doesn't scale to infinity without doing something on the hardware side.

- Likewise your mapping matters. Mistake #1 that I see companies make over and over again is to have dynamic mapping turned on in a logging cluster and then every string field both indexed as text and as a keyword. Don't do that. If you are not going to query on it, don't map it. Each additional mapped field consumes resources (memory, disk, IO, CPU). Elasticsearch and opensearch have index templates that you can use and rule based field mappings that you can set up.

- Set up your cluster in a sane way. If you are doing tiny amounts of data (a few tens of GB or less), a bog standard 2 or 3 node cluster is fine. I run a couple of logging clusters in essentially the cheapest Elastic Cloud setup that is allowed. Think less than 50-60$ per month. It's fine but it obviously won't scale endlessly. But otherwise, you might want specialized nodes, master nodes, ingest processors, query nodes, elastic scaling, etc. This is a good reason to use Elastic cloud, it comes with some sane, easy to manage options for this and you can easily scale your cluster. If you are not an expert, you will save a lot of time not doing it wrong.

- Ingest performance heavily depends on your mapping and how much heavy lifting you make Elasticsearch do at index time vs. how much you do before you send data to elasticsearch. A good ETL pipeline can offload a lot of the overhead to elsewhere. It's completely meaningless to even talk about ingest performance until you've done your homework on that front. Apples and oranges basically.

- Elasticsearch uses a lot of internal caching and off heap memory. That's why they recommend using at most 50% of memory for the Java Heap. That other 50% is used for loading blocks of disk into memory and file caching. If that runs out, you are looking at a lot of unnecessary disk IO when you query. So, look at your disk usage and shard, and ram allocation. If those don't line up, things are going to slow down.

Nothing against other solutions. IMHO a valid concern with ES is the level of complexity involved with using it properly. Many devops teams are just looking for simple but limited turn key solutions and there are some good options in the market for that. Especially if you don't have data or search specialists on your team, it's worth starting with those.


Thanks for the note. Appreciate it. Will look into the points you have highlighted and learn from it ( we are no way an expert in Elasticsearch) For this benchmark, we ran all the tools in default settings and compare how they did.

Performance benchmarks are not easy to execute. We understand each tool has nuances, and the testing environments must aim to provide a level playing field for all tools. We have tried our best to be transparent about the setup and configurations used in this performance benchmark, and looking forward to learn from the community here.

Would also like to share blogs from Uber[1] and Cloudflare[2] who had recenly migrated from Elastic to ClickHouse for logs, they may have more details on the points you are making above

[1]https://www.uber.com/en-IN/blog/logging/ [2]https://blog.cloudflare.com/log-analytics-using-clickhouse/




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: