
Why Vertica Customers Adopt Apache Druid for Real-Time Analytics - wochiquan
https://imply.io/post/compare-apache-druid-to-vertica
======
gopalv
Apache Druid is pretty amazing tool, with one assumption - your data has an
event timestamp as a crucial part of the data ingest & that it has no updates
at all.

My run-ins with Vertica for BI/PM metrics data is almost a decade old, but it
is a bit more powerful in the way it does projections + distributions for
instance.

The most common queries which Vertica got hit by was Unique users workloads,
which had intersections - there was a single table being ingested, but 3
projections. One partitioned by user, one partitioned by (user, property), one
partitioned by (user,property,date).

The biggest dimension tables were the A/B experiment id allocation list which
was duplicated on every single host.

A better storage model for this would be something like a Replex [1]

Druid can be used for the same sort of workload at a high scale (i.e millions
of users), where a best-effort distinct count is as good as the real thing,
but much faster.

If I had to do this today, I would also use the BloomKFilter in Apache Druid
for the experiment membership queries, which would also work better at
approximate queries than anything built to generate accurate results (& store
the dimension table in a slowly-changing-dimension store).

The real power of Druid is to push the segments to S3 + being able to
rehydrate off Kafka, to able to handle entire local data-loss without being
very expensive with EBS (i.e downloading segments to ephemeral SSDs), while
answering dashboard queries where a pixel is bigger than the error bar on
these approximations.

Plus the immutability of the data means, you can maintain a partial results
cache at the segment granularity rather than recomputing for every refresh of
the dashboard.

Picking up this problem today for a web-scale environment, I will pick Druid
for experiment data streams and define rollup aggregates ahead of time (over
say Clickhouse), but as things get more mutable and less time ordered, other
tools like Apache Kudu looks better at the storage layer.

[1] [https://blog.acolyer.org/2016/10/27/replex-a-scalable-
highly...](https://blog.acolyer.org/2016/10/27/replex-a-scalable-highly-
available-multi-index-data-store/)

~~~
sa46
The Procella[1] paper took a different approach to experiments. They embedded
an experiment ID array in table rows and indexed the rows by experiment ID
with a postings list.

Replex looks really neat. I've only skimmed the Acolyer summary so far. What's
the difference between a replex and multiple projections of data with
different partitions and sort-orders used in C-store and Vertica?

[1]:
[http://www.vldb.org/pvldb/vol12/p2022-chattopadhyay.pdf](http://www.vldb.org/pvldb/vol12/p2022-chattopadhyay.pdf)

~~~
gopalv
> What's the difference between a replex and multiple projections of data with
> different partitions and sort-orders used in C-store and Vertica?

The 3-replicas for failure tolerance are reused, so that the first 3 ordering
projections don't add storage costs to the system.

Also the paper doesn't mention it, but the rebuild traffic is also better
distributed on failure if the failure of a replica causes a rebuild that draws
from a wider set of machines rather than a single one.

------
manigandham
Vertica lost its lead a long time ago. There are better columnstore data
warehouses like MemSQL and Clickhouse, and most people are just moving to the
cloud with Redshift, BigQuery and Snowflake.

Druid doesn't really compete with these systems but is more of an addition if
you need low-latency queries against defined and fully-indexed fields along
with pre-aggregation (the source of its performance). It also has a lot more
operational complexity with very basic SQL support.

As of 2020, I don't see much of a use for Druid since columnstores already
support real-time updates and are adding indexing, aggregation pipelines and
concurrency scaling.

~~~
gianm
Hey Mani. Druid committer here. It actually is a column store! The project
makes a big deal about its ability to do indexes and pre-aggregation because
those are important capabilities and, while not unique, are also not
universally supported by every column store out there. So they are interesting
differentiators. But architecturally they are really just extra icing on the
cake.

Personally I see stuff like Druid, MemSQL, Clickhouse, Redshift, BigQuery, and
Snowflake as technological siblings in the space. These systems are all
evolving rapidly too (well, the healthy ones are anyway) so it's definitely a
good time to be an analytical database enthusiast.

With regard to the operational complexity, that's an interesting point. It
shows up in two main ways, I think -- the multi-process architecture and usage
of external deep storage. On huge clusters, which is what Druid was designed
for, the idea is that explicitly separating components in this way gives you
three benefits: they don't interfere with each other (spikes in ingestion load
won't interfere with ability to query historical data), you can scale each one
individually, and it makes most components "disposable" (as long as your
storage is reliable, the other Druid components can be blown away and
recreated without losing any data). It helps when you're trying to run a big
cluster in a stateless / containerized environment.

But these aspects are less good on small clusters or single servers, where it
just feels like a bunch of overhead. So we're currently working on simplifying
some of this for people that aren't running huge clusters.

We're also expanding SQL support rapidly. Almost every release adds additional
SQL capabilities. The next release is a big one, adding JOIN and GROUPING SETS
operators. The project's goal is to support it all before too long -- up next
after this release will likely be analytic functions.

If you're interested in checking out the community, we do meetups pretty often
(all virtual now, though, due to COVID-19). We're also planning our first user
conference later in the year @
[https://druidsummit.org/](https://druidsummit.org/).

~~~
manigandham
Hey Gian, I'm familiar with Druid since its start at metamarkets (and a client
of that company). I've been following Imply and you guys have done great work
at making Druid a lot better over the years.

I guess I should've stated _relational_ columnstore to describe the others.
Vertica has S3/remote storage interfaces similar to Historicals and all
vendors are adding indexing to columnstore segments beyond partition/zone maps
for fast seeks. MemSQL is the most advanced with in-memory tables to augment
the disk-based columnstores.

The improved SQL support will help and the overall design of Druid makes
sense, but I have to stand by the fact that I find it tough to recommend over
the alternatives now. If everything's converging on similar functionality,
what would you say is the roadmap for Druid's future advantage?

~~~
gianm
Those are good questions.

IMO Druid is most well-differentiated if you want to power an online, real-
time, high-concurrency analytical application at scale. It is the use case
Druid was originally designed for and still the one where the project shines
the brightest. The reason mostly isn't related to things that database people
usually talk about (storage format, indexes, etc). That stuff is important but
isn't a major differentiator between systems in today's world. The reason is
more related to the pieces in between servers, like locking, replication,
fault tolerance, data partitioning and balancing, and resource management.
Druid's approach to these things is relatively unique and gives it
characteristics that allow it to do well at powering these sorts of apps at
scale. I think it will remain an important advantage of Druid over other
systems. Maybe one day the details would make a good blog post :)

As far as the roadmap goes, most of the work we're doing to make Druid better
falls into two categories: first, stuff that makes it even better at this core
analytical app engine use case; second, stuff that better supports new use
cases, like the work on building out SQL. They are both important so usually
each release has a bit of both.

------
pachico
At this moment I cannot really understand why someone wouldn't use ClickHouse
for OLAP. Beyond my success experience with it in production, I'm currently
testing it with data that for historical or commercial reasons was always
sitting in BigQuery and the results are really fantastic. I don't think Druid
can compare to it at all.

~~~
beagle3
As of 2 years ago, ClickHouse did not have as-of joins, which was a
dealbreaker for me.

(Non is the usual suspects do either... I eventually used pandas+dask)

~~~
pachico
Lots of features have been implemented in the meanwhile, including asof.
[https://clickhouse.tech/docs/en/query_language/select/#selec...](https://clickhouse.tech/docs/en/query_language/select/#select-
join-strictness)

