My sense from the few projects I've seen attempt to use Druid is that there is quite a lot of infrastructure overhead / DevOps support required to manage a cluster at scale, and that fairly complex ingestion pipelines are required to load the data in the right format.
Anecdotally, I've heard that ClickHouse is easier to deploy from this perspective with similar performance, but would love to get others views / experience with these and similar data stores.
How is the documentation for this deployment and tuning? Last time I checked I had the impression that anything about the dozen of different node types weren't very clear, not to mention details about the ingestion were all over the place.
And it was hard to find examples of configuration, ingestion other than basic tutorials
Strange - we use Druid at scale at work, and it doesn't really require much maintenance. In fact, for the most part aside from some incidents caused by our own misconfiguration, it buzzes away just fine. Runs on Kubernetes. Most annoying thing was tuning segment compaction.
It's actually quite nice - because since everything is stored on GCS/S3, it is mostly self healing, we can treat the historical as cattle and not pets.
We also run clickhouse, and unfortunately the above is not true - at least in our setup.
This is one of the reasons why we chose ClickHouse as our backend at PostHog. We wanted something that was relatively simple to operate for our users who deploy PostHog on-prem. We've been super happy with it so far. Still not turn key in many ways, but it's been pretty great.
I run a ClickHouse cluster. I’ve heard Druid is more difficult, but ClickHouse isn’t exactly a piece of cake. It’s great for a while, but sometimes you can get bitten by weird states. Still, compared to everything I’ve used, for the use case where it excels, it really excels.
Clickhouse benefits from the ability to get started as a single binary (and even includes a clickhouse-local binary to do ad-hoc analysis on CSVs on your laptop). There’s only one node type, as well. It’s simpler and easier in that sense.
Running it at scale is different. It includes everything you need, and it’s not horrible of course - but there are certainly a lot of sharp edges to be mindful of.
Druid is horizontally scalable by itself if you have access to something like S3 or any compatible Object Storage. Druid's core design is remarkable, designed from the ground up to optimally leverage and work harmoniously with cloud tech. Once it is setup appropriately for your use-case, it's trivial to stamp out over and over with Terraform.
Several years back, I was running it on a single server along with Kafka ( I was requested to keep everything related to analytics to a single huge server ) and it started with a fight for zookeeper between these two. While it worked quite well thereafter, keeping it up was a battle.
Maybe situation is better with clusters and Kafka moving away from zookeeper.
I feel like it shares a lot of the complexity with the rest of Hadoop-adjacent products. If your company already manages standard Hadoop infra, it's probably not too different, otherwise it seems a quite bumpy road.
Clickhouse, trino, kylin, bigquery, snowflake, plenty of competitors.
Having worked in the space for a few years, druid is very fast when you know exactly what you're doing. Until you reach that point, you're going to have a bad time. And if you try to run it yourself under kubernetes you're going to have a really bad bad time. Druid is amazingly fast, but no development effort is going into making it easier to run.
We evaluated pulsar plus trino/presto a bit ago which seems to be similar in spirit. Pulsar promised to be a slick follow on to Kafka with tiered storage. We found some terrible write/read latency problems when we tried to force the spillover to s3. We gave up on it.
Anecdotally, I've heard that ClickHouse is easier to deploy from this perspective with similar performance, but would love to get others views / experience with these and similar data stores.