
Apache Druid vs. Time-Series Databases - wochiquan
https://imply.io/post/apache-druid-vs-time-series-databases
======
Nihilartikel
I built out a Druid backend for interactive graphing & aggregation of web
traffic and application security metrics a few years back. Users could choose
arbitrary filters, aggregations, and time slicing. This was a second system
replacing a Spark cluster running over timeseries events in Cassandra, which
wasn't really practically scalable. Tuning and debugging the Spark queries and
Cassandra performance was an endless time sink.

Druid worked really well for almost all use cases, reliably getting sub-second
results against many billions of records even on a pretty modest deployment.
Being able to use arbitrary Javascript functions in queries was fantastic, so
we could do things like filtering on IP subnets or case insensitive
prefix/suffix matching, and the like, as needed.

The Docker setup that Druid ships with is deceptively simple - making a
production installation was an effort. My thoughts are:

\- Build templating and code generation for the many config files for each
component early on so you can edit constants in a single place and have all of
the sundry config files update to reflect them, and also manage per-host
overrides in a sane, version controlled, way.

\- Druid will use as much ram as you can throw at it, but in a pinch, reading
directly from fast NVME storage is pretty good.

\- If you have realtime data ingestion, you will also have to build tooling to
re-ingest older data that has changed or needed to be amended. This will end
up looking like a 'lambda architecture'

~~~
battery_cowboy
I'm not a data guy, can you explain this part a bit more if you have time?

> you have realtime data ingestion, you will also have to build tooling to re-
> ingest older data that has changed or needed to be amended. This will end up
> looking like a 'lambda architecture'

~~~
ignoramous
Lambda architecture for data processing, as popularized by Nathan Marz et al
[0], has two components, the Batch layer and the Stream layer. At a high
level, _Batch_ trades quality for staleness whilst _Stream_ optimises for
freshness at the expense of quality [1].

I believe what GP means by _Lambda_ is that, you'd need a system that _batch_
processes the data to be amended / changed (reprocess older data) but _stream_
processes whatever that's required for real-time [2].

An alternative is the Kappa architecture proposed initially by Jay Kreps
[3][4], co-creator of Apache Kafka.

\---

[0]
[https://www.amazon.com/dp/1617290343](https://www.amazon.com/dp/1617290343)

[1]
[https://en.wikipedia.org/wiki/Lambda_architecture](https://en.wikipedia.org/wiki/Lambda_architecture)

[2] [https://speakerdeck.com/druidio/real-time-analytics-with-
ope...](https://speakerdeck.com/druidio/real-time-analytics-with-open-source-
technologies-1?slide=41)

[3] [https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)

[4] [https://dataintensive.net/](https://dataintensive.net/)

~~~
sologoub
The sources are good and thorough, but very long. Here’s an ok summary of
kappa proposal: [https://milinda.pathirage.org/kappa-
architecture.com/](https://milinda.pathirage.org/kappa-architecture.com/)

In theory this sounds great, but you have to account for processing capacity.

While compute is getting cheaper, one of the key reasons streaming in lambda
sacrifices quality over throughput is compute capacity (as well as timing). If
you have to feed already stored data through the same streaming pipe, you
either have to have a lot of excess capacity, be willing to pay for that
additional burst or accept latency in your results (assuming you can keep up
with your incoming workload and not lose data). There is no free lunch.

------
quietgolfer
I looked into Apache Druid for my company's event aggregation and serving. We
decided to go with Apache Pinot instead. A major reason was serving latency.
Most of the docs I saw that talked about Druid latency (and their goal for
latency) made it seem like 500ms was okay for query latency speeds. Pinot
seemed more focused on query latency. 50ms was more acceptable for my team's
latency goal. I ran some basic latency tests and Pinot was faster (2mil and
22mil events, no real optimizations). I think Pinot's star-tree index would
perform great for my team's use case.

Druid's intro flow was awesome. The tutorials worked very well. I hit hiccups
with Pinot's intro flow. The Pinot team has been very responsive on Slack to
issues.

Here are some notes:

[https://docs.google.com/document/d/1GiB8zoiJ4Qs10A1LdIr6lJGW...](https://docs.google.com/document/d/1GiB8zoiJ4Qs10A1LdIr6lJGWTfk-0ElU8aJbejomstY/edit)

~~~
ychb
Similar to @quietgolfer, we also have strict requirement on serving latency,
and we benefited from Pinot's star-tree index. We had a table with more than 1
billion events, and we had to do DISTINCTCOUNT and SUM with GROUPBY. Initially
without optimization, queries timed out (after 15s). After star-tree index was
added, the query returned in 500ms.

Based on our earlier benchmark, Pinot query latency without star-tree is
already slightly better than Druid:
[https://image.slidesharecdn.com/lnkdmeetupslides0426-1705021...](https://image.slidesharecdn.com/lnkdmeetupslides0426-170502173242/95/pinot-
near-realtime-analytics-uber-18-638.jpg?cb=1493746410). Though that benchmark
was done years ago, the result seems consistent with the recent test run by
@quietgolfer.

------
polskibus
Why choose Druid over clickhouse?

~~~
manigandham
Most people shouldn't. Clickhouse or other column-store data warehouses
(redshift, bigquery, etc) are very fast and have all the features to handle
time-series and other data.

Druid is good if you (1) make use of native integrations like Kafka (2) Need
every field indexed for fast seeking to a few rows (3) can use the JSON API to
make up for the still-in-beta SQL interface (4) don't need every single
event/row as they are pre-aggregated (5) always have a time-column to
partition by (6) want to use the native S3 tiering for older data (7) dont
need joins and complex analysis

Imply's distribution is better than core Druid but it's still more
operationally complex than Clickhouse and alternatives.

~~~
DevKoala
You are right about Clickhouse, but other data warehouses are not optimized
for the same use case of Druid and Clickhouse, OLAP.

For example, RedShift and BigQuery cannot be put behind a user facing backend.
BigQuery has a default limit of 50 concurrent queries, if that's your user
limit, perfect. RedShift takes seconds for queries that Druid and Clickhouse
can answer in milliseconds.

~~~
manigandham
All data warehouses are designed for OLAP, that's their purpose. It doesn't
require low latency though.

Redshift is an always running cluster of scale-out distributed postgres forked
by AWS so it can and does return results in milliseconds, very similar to
Clickhouse although still not as advanced in performance techniques.

Bigquery is a completely managed model that uses far greater scale-out
architecture designed for througput (petabytes in seconds) rather than
latency, although it has real-time streaming, BI Engine (memory cache) and
materialized views so you can get pretty close today.

Snowflake is another option that runs on top of major clouds using instances
to access object storage and also has low latency when your cluster is
running.

~~~
DevKoala
Not all data warehouses are designed for OLAP. As the performance of these
systems increase, it is possible to satisfy OLAP use cases with some
warehouses such a BQ, but they are still not ideal as your number of
concurrent queries scale and your response time requirements remain constant.
The BQ team certainly doesn't consider their product an OLAP solution, neither
the RedShift team. AWS is actually working on their own OLAP solution,
Timestream; that is how they pitch it, as their OLAP solution.

BigQuery's petabyte aggregations in seconds is a false claim. I just launched
a query to count the number of records of a 1.4 PB dataset and it has taken
over a minute to just schedule the query, I have 1000+ compute units available
too. I have been able to perform a similar query in seconds on top of Druid;
the cluster had a run rate of $25k/month at the time, while I believe we pay
40k+/month just on our BQ deal. Maybe under ideal conditions BigQuery can run
such a query, but I have never seen it do such things.

I have experience with Snowflake, it is great if you are in AWS, but I
wouldn't choose it over BigQuery. I feel that Snowflake's strength is in their
data sharing protocol, which allows us to tap into the datasets of partners
without major data transfer costs. We use it for that reason only.

~~~
manigandham
You're using a strange definition of OLAP here when it's really not tied to
performance or concurrency. It only means you can send an ad-hoc query and get
back results directly in response, rather than waiting on pre-computed or
batched processing happening somewhere else.

Redshift and BigQuery are both OLAP products and their documentation and
whitepapers state the same. The performance profiles of both are very
different but that's due to the architecture. Redshift is an always running
cluster of EC2 instances running their forked scale-out Postgres. BigQuery is
an entirely different engine based on Dremel which focuses on parallelization
as much as possible along with Google's proprietary datacenters that have
petabits of internal bandwidth to support it. AWS Timestream is an entirely
different product for time-series specific data, and basically

The only reason Druid is faster is because it pre-aggregates and indexes data.
You can aggregate tables yourself in Redshift and use the right primary keys
to get most of the same performance. BigQuery just launched materialized views
and is working on scheduling improvements and in-memory caching as well.
Snowflake now runs in Azure and GCP and can do sub-second responses when the
warehouse is running and data is cached or the tables are smaller.

~~~
DevKoala
I wasn’t able to achieve the same performance with RedShift aggregates, I
tried that first before I decided to migrate from RedShift to Druid back in
2014. We deal with dozens of dimensions per event and no combination of
distribution keys in Redshift was able to give up the same performance over
arbitrary scans+ aggregations.

Druid is not only fast because it pre-aggregates, but the memory structure is
designed for scans over hundreds of dimensions.

Materializing views in BigQuery is just one DAG task. Unless you don’t have
something like Airflow on your stack, I don’t see how it is worth mentioning.
We are talking about denormalized data, time series data.

I am speaking from experience with each one of these products. Perhaps I did
it all wrong, but we certainly achieved the objectives we were after.

------
yumraj
Just curious what are people using Druid, Clickhouse, Pinot... for?

would love to learn by knowing about the use-cases where I should perhaps
consider such technologies..

~~~
hodgesrm
Some common ClickHouse use cases and verticals shown below. Basically any case
where you have relatively structured data, very large fact tables, and a need
for low latency response. Names of companies that have given public talks are
shown.

* Analyzing netflow logs (network management, numerous users)

* Content delivery (Mux.com)

* Web analytics (ContentSquare, CloudFlare)

* Generating optimized parameters for real time ad bidding (various, not too many recent talks on this)

* Log management and system observability (surprisingly common across industries-Sentry, Cloudflare, Uber)

* Valuing assets from market data, e.g. tick data (financial services, not many companies will discuss publicly)

Druid is used for some of these. I don't know that much about Pinot, hence
cannot comment there.

------
rb808
I stumbled across Druid (with superset) and it looks great but has a really
low profile, even though it is in a hot area. Does anyone actually use it?

~~~
switzer
I used it a few years ago with my former company. It effortlessly injected 10K
rows of data per second, and we configured Druid over 50 dimensions and about
20 different metrics. In that configuration, Druid was able to respond to most
queries within a second. I am not sure how it works but it is an amazing
product. We are evaluating it now for use at my current company.

~~~
shaklee3
To be fair, 10k rows per second is extremely slow. Clickhouse can do millions
per second on a single server in our tests.

------
jonpo
Is it really better though? i feel like we need more data. what are the
tradeoffs? seems like a sales article

~~~
adequateness
It is an article published on a blog of a company talking about what said
company does.

------
starpilot
kdb/Shakti

~~~
quod_2058
Or questDB if you want that sort of performance but prefer sql

~~~
starpilot
Fast as kdb? You have a single fact to back that up?

~~~
quod_2058
Saw this last week. The examples are limited but clearly easy to reproduce. I
saw in a discussion on Reddit where it seems they plan to get even faster
using more techniques like prefetch.
[https://news.ycombinator.com/item?id=22803504](https://news.ycombinator.com/item?id=22803504)

~~~
starpilot
mother of god.

