Hacker News new | past | comments | ask | show | jobs | submit login
Druid: A Real-Time Analytical Data Store (micahlerner.com)
73 points by mlerner on July 28, 2022 | hide | past | favorite | 28 comments



My sense from the few projects I've seen attempt to use Druid is that there is quite a lot of infrastructure overhead / DevOps support required to manage a cluster at scale, and that fairly complex ingestion pipelines are required to load the data in the right format.

Anecdotally, I've heard that ClickHouse is easier to deploy from this perspective with similar performance, but would love to get others views / experience with these and similar data stores.


We found the opposite. Setting up Druid clusters is the easiest compared to its competitors: Clickhouse and Pinot.

Especially since ingestion goes straight to S3. We don’t really worry about backups (just deal with PG backups).

Just make sure your ZK is happy and all will be well.

The hard part about Druid is tuning:

- the ingestions: Spec definition, compaction, sharding strategy, RAM consumptions, etc.

- and query performance: RAM consumptions, number of threads, timeouts, etc.


How is the documentation for this deployment and tuning? Last time I checked I had the impression that anything about the dozen of different node types weren't very clear, not to mention details about the ingestion were all over the place.

And it was hard to find examples of configuration, ingestion other than basic tutorials


Docs definitely have rooms for improvement.

Architecturally, It is easier to visualize this two big group:

- query serving: coordinator, historical, broker

- ingestion: overlord, middlemanager

router unifies all of Druid API together.

I would start with the Helm chart to get some basic idea on tunings.


> Just make sure your ZK is happy and all will be well.

That sounds like the opposite of easy to setup (and maintain).


We had druid in production and this was our main weak link. Its really hire to find people who know how to operate ZK well in production.


Strange - we use Druid at scale at work, and it doesn't really require much maintenance. In fact, for the most part aside from some incidents caused by our own misconfiguration, it buzzes away just fine. Runs on Kubernetes. Most annoying thing was tuning segment compaction.

It's actually quite nice - because since everything is stored on GCS/S3, it is mostly self healing, we can treat the historical as cattle and not pets.

We also run clickhouse, and unfortunately the above is not true - at least in our setup.


ClickHouse is great. Been using it without any major issues.

We started experiencing some issues with data duplication with ClickHouse when we moved our table to a Sharded+Replicated setup.

Optimize with DEDUPLICATE helped us a lot.. and we can just run this on a Partition instead of the full table.

https://clickhouse.com/docs/en/sql-reference/statements/opti...

ClickHouse is a very powerful system but it's not just setup and forget type.


I agree the same. Clickhouse is the best if you are looking for Analytics database.


This is one of the reasons why we chose ClickHouse as our backend at PostHog. We wanted something that was relatively simple to operate for our users who deploy PostHog on-prem. We've been super happy with it so far. Still not turn key in many ways, but it's been pretty great.


I run a ClickHouse cluster. I’ve heard Druid is more difficult, but ClickHouse isn’t exactly a piece of cake. It’s great for a while, but sometimes you can get bitten by weird states. Still, compared to everything I’ve used, for the use case where it excels, it really excels.


clickhouse is difficult to operate and does support join well, i use https://github.com/apache/doris replace our clickhouse and druid workloads


Clickhouse benefits from the ability to get started as a single binary (and even includes a clickhouse-local binary to do ad-hoc analysis on CSVs on your laptop). There’s only one node type, as well. It’s simpler and easier in that sense.

Running it at scale is different. It includes everything you need, and it’s not horrible of course - but there are certainly a lot of sharp edges to be mindful of.


Druid is horizontally scalable by itself if you have access to something like S3 or any compatible Object Storage. Druid's core design is remarkable, designed from the ground up to optimally leverage and work harmoniously with cloud tech. Once it is setup appropriately for your use-case, it's trivial to stamp out over and over with Terraform.


Several years back, I was running it on a single server along with Kafka ( I was requested to keep everything related to analytics to a single huge server ) and it started with a fight for zookeeper between these two. While it worked quite well thereafter, keeping it up was a battle.

Maybe situation is better with clusters and Kafka moving away from zookeeper.


I feel like it shares a lot of the complexity with the rest of Hadoop-adjacent products. If your company already manages standard Hadoop infra, it's probably not too different, otherwise it seems a quite bumpy road.


Agreed- we use Druid, it's quite successful, but the Druid hosting situation up until this point has been pretty lousy.


What other options are there of this type? ClickHouse — anything else?


Clickhouse, trino, kylin, bigquery, snowflake, plenty of competitors.

Having worked in the space for a few years, druid is very fast when you know exactly what you're doing. Until you reach that point, you're going to have a bad time. And if you try to run it yourself under kubernetes you're going to have a really bad bad time. Druid is amazingly fast, but no development effort is going into making it easier to run.


This has exactly been my experience. It needs you know exactly what you are doing and what you are using the system for.

Suprised to see comments on this thread saying they are running druid on kubernetes with no issues at all.


We evaluated pulsar plus trino/presto a bit ago which seems to be similar in spirit. Pulsar promised to be a slick follow on to Kafka with tiered storage. We found some terrible write/read latency problems when we tried to force the spillover to s3. We gave up on it.



why are you posting this comment so many times.


maybe doris is good (http://github.com/apache/doris)

join support is bad for clickhouse. For apache doris, performance and join is better than clickhouse.


i think clickhouse is the main one but there are other like apache pinot. snowflake and co are different so not really in the same class of products.


I think Apache doris is better for real-time data analysis https://github.com/apache/doris https://doris.apache.org/


what is the difference with http://github.com/apache/doris


database track




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: