Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Apache Druid is pretty amazing tool, with one assumption - your data has an event timestamp as a crucial part of the data ingest & that it has no updates at all.

My run-ins with Vertica for BI/PM metrics data is almost a decade old, but it is a bit more powerful in the way it does projections + distributions for instance.

The most common queries which Vertica got hit by was Unique users workloads, which had intersections - there was a single table being ingested, but 3 projections. One partitioned by user, one partitioned by (user, property), one partitioned by (user,property,date).

The biggest dimension tables were the A/B experiment id allocation list which was duplicated on every single host.

A better storage model for this would be something like a Replex [1]

Druid can be used for the same sort of workload at a high scale (i.e millions of users), where a best-effort distinct count is as good as the real thing, but much faster.

If I had to do this today, I would also use the BloomKFilter in Apache Druid for the experiment membership queries, which would also work better at approximate queries than anything built to generate accurate results (& store the dimension table in a slowly-changing-dimension store).

The real power of Druid is to push the segments to S3 + being able to rehydrate off Kafka, to able to handle entire local data-loss without being very expensive with EBS (i.e downloading segments to ephemeral SSDs), while answering dashboard queries where a pixel is bigger than the error bar on these approximations.

Plus the immutability of the data means, you can maintain a partial results cache at the segment granularity rather than recomputing for every refresh of the dashboard.

Picking up this problem today for a web-scale environment, I will pick Druid for experiment data streams and define rollup aggregates ahead of time (over say Clickhouse), but as things get more mutable and less time ordered, other tools like Apache Kudu looks better at the storage layer.

[1] https://blog.acolyer.org/2016/10/27/replex-a-scalable-highly...



The Procella[1] paper took a different approach to experiments. They embedded an experiment ID array in table rows and indexed the rows by experiment ID with a postings list.

Replex looks really neat. I've only skimmed the Acolyer summary so far. What's the difference between a replex and multiple projections of data with different partitions and sort-orders used in C-store and Vertica?

[1]: http://www.vldb.org/pvldb/vol12/p2022-chattopadhyay.pdf


> What's the difference between a replex and multiple projections of data with different partitions and sort-orders used in C-store and Vertica?

The 3-replicas for failure tolerance are reused, so that the first 3 ordering projections don't add storage costs to the system.

Also the paper doesn't mention it, but the rebuild traffic is also better distributed on failure if the failure of a replica causes a rebuild that draws from a wider set of machines rather than a single one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: