Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Quickwit – OSS Alternative to Elasticsearch, Splunk, Datadog (github.com/quickwit-oss)
145 points by francoismassot 10 months ago | hide | past | favorite | 51 comments
Hi folks, Quickwit cofounder here.

We started Quickwit 3 years ago with a POC, "Searching the web for under $1000/month" (see HN discussions [0]), with the goal of making a robust OSS alternative to Elasticsearch / Splunk / Datadog.

We have reached a significant milestone with our latest release (0.7) [1], as we have witnessed users of the nightly version of Quickwit deploy clusters with hundreds of nodes, ingest hundreds of terabytes of data daily, and enjoy considerable cost savings.

To give you a concrete example, one company is ingesting hundreds of terabytes of logs daily and migrating from Elasticsearch to Quickwit. They divided their compute costs by 5x and storage costs by 2x while increasing retention from 3 to 30 days. They also increased their durability, accuracy with exactly-once semantics thanks to the native Kafka support, and elasticity.

The 0.7 release also brings better integrations with the Observability ecosystem: improvements of the Elasticsearch-compatible API and better support of OpenTelemetry standards, Grafana, and Jaeger.

Of course, we still have a lot of work to be a fully-fledged observability engine, and we would love to get some feedback or suggestions.

To give you a glance at our 2024 roadmap, we planned to focus on Kibana/OpenDashboard integration, metrics support, and pipe-based query language.

[0] Searching the web for under $1000/month: https://news.ycombinator.com/item?id=27074481

[1] Release blog post: https://quickwit.io/blog/quickwit-0.7

[2] Open Source Repo: https://github.com/quickwit-oss/quickwit

[3] Home Page: https://quickwit.io




> To give you a concrete example, one company is ingesting hundreds of terabytes of logs daily and migrating from Elasticsearch to Quickwit. They divided their compute costs by 5x and storage costs by 2x while increasing retention from 3 to 30 days

I guess that's to be expected. Almost anything is more storage-efficient than Elasticsearch, FTS is so expensive.


Quickwit is FTS too though. I think the difference comes from the fact they stored stuff on EBS while Quickwit stores its index on S3.


Looking at the docker compose, this seems like a very complicated tool to run.

You'd need Kafka, zookeeper, and Jaeger. All would need to be HA. Then also this service. Not mentioning postgres because in theory you can use aurora or the like.

How quick have your current customers been able to get up and running so far? And how much maintainence have they needed?


Most users do not use Kafka/Zookeeper. The only external service for them is a S3 bucket. They then use the PushAPI. It is perfectly fine if you have only a couple of TB a day.

For the crazy large use cases, like the one described in the blog post, Kafka becomes necessary. At that scale, our users usually already have their data in Kafka or RedPanda and are actually happy to be able to get native integration: - their data does not need to be copied/replicated in a WAL "again" - we get exactly-once semantics

Also, in 0.8, we will be adding proper support for distributed ingest. The feature is actually already implemented and was originally scheduled for 0.7. but we preferred to test it more before actually shipping it.


Quickwit supports different data sources: Kafka, Pulsar, Google pubsub, ... and we have our own ingest API (not HA right now, but it will be the case in the next release in 1 month or so).

Postgresql is not mandatory; it's also possible to use Quickwit with a metastore on S3. For large use cases, Postgresql is the way to go. I've seen users using Quickwit with metastore on S3, RDS, and Aurora.

On the UI side, we have several users who have their own UI. Jaeger is used just for the UI part so it's quite simple to have it in HA, I don't thing it's hard to have HA for Grafana but I'm not sure on this point.

Which docker compose did you look at?


According to the documentation[1] Kafka is just one of the supported inputs for ingestion, so it should be possible to run Quickwit without it if you're not intending to write logs into Kafka. Jaeger also seems like another optional dependency. Same also probably for Zookeeper?

1: https://quickwit.io/docs/ingest-data/kafka


Zookeeper is not used by Quickwit at all. It is used by Kafka however.


I continue to see that dependency almost everywhere I see Kafka used, and yet KIP-500 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A...> and its Jira <https://issues.apache.org/jira/browse/KAFKA-9119> are shown as "Resolved Apr-2021" allegedly in 2.8.0

I wonder if it's going to be the "python2.7" (or I guess more relevant "Java8") of running Kafka :-(


What I am really missing in these (really nice) alternatives to ELK is Kibana and its Lens.

I tried to tapping this with Grafana but never learned to have graphs as easily as with Kibana. Maybe I was not trying hard enough?

Has anyone replaced Kibana with Grafana for non time based graphs?


Good point. Several users are asking us the OpenDashboard/Kibana compatibility, and this is on the 2024 roadmap.

That being said, we also hear users complaining about OpenDashboard/Kibana, looking for an alternative different from Kibana/Grafana explore view (the view used for log and tracing search). You will also find users satisfied by the Grafana Explore view.

Personally, I don't find the Grafana Explore view great for log searches. I saw that Grafana recently made some improvements, and I need to dig into that to adapt the Quickwit Grafana plugin. I don't have a clear opinion on Kibana, one of my dreams is to build a better UI for log/traces search anyway, not yet on the roadmap though :)


I am impressed with this work, especially given our current use of Loki with Prometheus and Grafana.

Few questions come to mind. Firstly, is Quickwit compatible with any S3 compatible object storage, such as Cloudflare's R2? Are there particular considerations to keep in mind for this kind of setup? Secondly, do you see Quicwit being used for analytics, such as tracking daily visits or analyzing user retention?

Your insights on these would be greatly appreciated.


Quickwit is compatible with S3-compatible object storages yes. I don't remember the feedback for R2 specifically, but we have users running Quickwit on S3, GFS, Azure, MinIO, Garage, IBM, and all of the major chinese clouds.

> Secondly, do you see Quicwit being used for analytics, such as tracking daily visits or analyzing user retention?

Excellent question. Quickwit is very fast on ElasticSearch aggregations. We do not support cardinality-aggregation but it is scheduled for version 0.8.

Analyzing user retention and generally speaking running complex analytics, will not be possible any time soon. Maybe next year?


Tangent question. What do you think about the vision of the future of storage from folks at ScyllaDB have?

There is a great presentation from their recent conference [1]. I am asking this because I find it quite similar to what you are doing - decoupling storage from compute and utilising S3. I really appreciate if you could share your insights.

[1] starts at 30:45 https://m.youtube.com/watch?v=ZX7rA78BYS0


I had a look at this passage of the talk. Generally S3 is more obvious for OLAP where insert are very batchy and reads are large.

For OLTP, the equation is less obvious. I'd be scared to suffer from the cost associated with PUT/GET requests. (Pushing once every 30s or so equates to a few dollars per month.)

Since Scylla is based on an LSM-Tree (I think), I would have expected the talk to be about saving sstables on S3 but keeping the WAL on disk... But the slides seem to point to pushing the WAL to S3.


This is great, thanks for your input.


When open source providers say 'alternative' to a commercial solution, do they consider the serious engineering needed to scale such systems? I mean I used SigNoz, and it's comparable to Datadog feature wise but nowhere close in performance and scalability.


I must admit that 'alternative' is always a tricky word... Datadog, Elasticsearch, and Splunk are giant beasts, and the alternative makes sense only on a subset of features (and hopefully, we will successfully execute our 2024 roadmap to reduce the difference)

For Quickwit, our users proved to us it was scaling up to petabytes. So we consider this scale factor in the "alternative". But... we don't have a dedicated metrics storage engine yet, so if you want to store metrics in Quickwit, it won't be efficient in the current version. It will come later this year.


hey - SigNoz maintainer here.

> it's comparable to Datadog feature wise but nowhere close in performance and scalability

Would love to understand how you tested SigNoz and what were the issues you found in performance and scalability?


I have seen it ingest 500K events/s. How did you conclude the poor perf?


I'm surprised no one has mentioned that it's mostly programmed in Rust.


Next time don't mention it and we'll enjoy the peace and quiet :)


Elasticsearch was written in Java. No one cared. We still used it because it was good software.


Actually we chose not to use it after talked to the developers about some of their approaches at the time. They had weird views on GC and made it impossible to tune their stack to have lower latency. It was not really an option to operate ES for a larger enterprise with decent SREs.


Glad this is getting some love. This is seriously good software. Have you guys supported generic substring search yet? I recall it was not supported as of a few months ago.


Not yet. Only prefixes. Also you could probably cook something with an ngram tokenizer.

Is it for a field with a high cardinality? If you tell us more about your use case, maybe we can find a workaround.


No just curious. I understand how your indexing structure based on SSTables could find it challenging to support substring search in general. I think it tradeoff between fast querying and flexible functionality


Congrats on the launch, we'll have to get you integrated with Rootly :). We can enable incident responders to fetch metrics while they respond to incidents in Slack!


How far away is the ES Query DSL compatibility?


I see that you persist the logs in cloud storage where are the indexes stored?


The most interesting part of Quickwit for me has been that because it stores the index in object stores, you can theoretically abstract away a lot of the horizontal scaling with things like MinIO or SeaweedFS that have S3-compatible APIs w/o having to go to the cloud.

Unfortunately, I think it's a large reason why it was made a design choice that indexed documents are immutable (IIRC), which doesn't work for my use case.


So yes, currently we only support infrequent deletes for GDPR reasons mainly.

It's possible to add updates/deletes to Quickwit, but this is a lot of work, and for now, we have not prioritized this development.

Do you mind sharing your use case?


Oh, just run-of-the-mill full-text search of web crawls.

Some web pages update frequently, so I would need something that can handle that.

I do understand that you seem to be targeting the log use-case of Elasticsearch moreso than the "Apache Solr" use-case.

I would be in passing curious of what about the GDPR makes infrequent deletes a design choice. My understanding of GDPR is that the "right to be forgotten" aspects would if anything require deletions to be inexpensive.


Web crawl can be an ok use case actually.

The idea then would be to "reindex" the world. It might seem ludicrous, but to give you an idea, indexing CommonCrawl takes about a day with 8vCPUs.


I would assume that users requesting to be forgotten is a fairly infrequent occurrence compared to just about any other operation.


True, but it can also require massive deletes/updates.

I suppose their are ways to structure things to minimize this, though.


We store the indexes on the object storage. We worked on the index data structure to optimize the query path.

For use cases where you have a lot of QPS, we recommend using Garage or MinIO or the feature to cache index data on the local disks (new in the 0.7).

I wrote a blog post that explains how we do that: https://quickwit.io/blog/quickwit-101


Why don't you use clickhouse or some columnar storage for the logs? You get indexing or querying for free.


It is a indexing vs search trade-off.

Let's just consider IO, as it is the main effect here. With a columnar, you have to read all of the fields targeted by your query.

If this is log, let's assume

- 200B per line of logs

- 100TB of logs = 500 billions lines of logs

- 30 days of retention.

- a body text field taking 70% of your data.

- highly compressible (10x compression ratio). Note this will come with a higher cpu cost, but let's focus on an IO lower bound.

If you have 100TB of data, regardless of the query, you will have to read (and decompress, but let's not talk about cpu) 7TB worth of data.

Now with an inverted index? You will have to read the posting lists only. The posting lists are delta-encoded and bitpacked.

Assuming the probability of presence of a given term in a log line is p, the worst thing that can happen is having a token that is in 99% of the documents.

In that case, you will have to read 2.06 bits per documents. That's 128GB for the worst posting list.

If you are looking for a single keyword, in the worst possible case, you will have to read 54 times less data than with the columnar solution.

In practice, users search for several keywords, but also considerably less pathological than the example I just gave. Overall you will typically end up reading 20 to 100 times less data than with the grep solution.

I left CPU aside, but actual search engines are also much more CPU efficient.

But I said there was a trade-off... where is it? Well you had to pay a much higher cost at indexing.

Some search engine implementation makes it seem like indexing is more expensive than it should be. Quickwit/tantivy are especially efficient there. With a 4vCPUs VM, you can expect to index at 2TB/day. So in the example above, you will have to dedicate 8vCPUs for indexing. This is perfectly reasonable.

BUT if your retention is much much shorter (few days), indexing might not be worth it.

If your volume of data is small too, you probably do not need to care at all about efficiency.


So Quickwit is primarily a search engine and thus relies on an inverted index. We also implemented our schemaless columnar storage optimized for object storage.

The inverted index and columnar storage are part of tantivy [0], which is the fastest OSS search library out there (except for the academic project pisa) [1]. We maintain it, and we decided to build the distributed engine on top of it.

[0] tantivy github repo: https://github.com/quickwit-oss/tantivy

[1] tantivy bench https://tantivy-search.github.io/bench/


How is it compared to Signoz, which is also open source?


I know that SigNoz uses Clickhouse and focuses more on observability UX.

Quickwit is its own data engine.


SigNoz maintainer here.

We also have traces, metrics and logs in a single application which makes correlation across them much easier. From what I can understand from Quickwit website, they use Grafana and Jaeger for UI.

Here's our github repo if you want to check it out. https://github.com/signoz/signoz


That license is not FOSS. Users beware of the impending subscription costs to come.


Quickwit is under AGPLv3. Are you saying that AGPLv3 is not FOSS?


I find presenting this as an open source alternative to commercial solutions a little disingenuous when any commercial use of it also requires a paid license. Like many other cases it seems like the AGPL is functioning more as a trial license.


Well, I would say it depends. We have many companies using the AGPL version without buying a license. We also know that some companies have strict policies and will forbid using AGPL software unless taking a commercial license. We're happy with both users.

I like the example of Grafana with all their AGPL projects (Grafana, Loki, Tempo, ...). There are a LOT of companies using Grafana with the AGPL version.


The GNU AGPL is considered a Free Software licence by the OSI, FSF, and the Debian project. That's good enough for me.


You can use AGPL for commercial use. The limitation is that you will have to opensource any patch you make to it.


You can double license software and have a non-AGPL license for customers. AGPL is a must nowadays to avoid Amazon and other giant companies like it to "steal" your project and start offering the same thing initially (2-3) cheaper while you are going out of business.


Ianal, but afaik one is required to release patches to the product itself only if product functionality is exposed to customers. If it is internal tool, patches I believe can stay internal. But if you want to offer, for example, hosted solution based on this product - then you are required to release modifications.


I find AGPL perfect for this use case, and my org (>100k hosts) can use it without any problem as we are using it for internal purposes, and not, say, rebranding it and offering as a part of our own product.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: