Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: A benchmark for analytical databases (Snowflake, Druid, Redshift) (clickhouse.com)
137 points by zX41ZdbW on July 13, 2022 | hide | past | favorite | 57 comments
I created a web page to compare different analytical databases (both self-managed and services, open-source and proprietary) on a realistic dataset. It contains 20+ databases, each with installation and data loading scripts. And they can be compared to each other on a set of 43 queries, by data load time or by storage size.

There are switches to select different types of databases for comparison - for example, only MySQL compatible or PostgreSQL compatible.

If you play with the switches, many interesting details will be uncovered.

Full description: https://github.com/ClickHouse/ClickBench/blob/main/README.md




This is impressive work: it's time consuming to set up and benchmark so many different systems!

Impressiveness of the effort notwithstanding, I also want to encourage people to do their own research. As a database author myself (I work on Apache Druid) I have really mixed feelings about publishing benchmarks. They're fun, especially when you win. But I always want to caution people not to put too much stock in them. We published one a few months ago showing Druid being faster than Clickhouse (https://imply.io/blog/druid-nails-cost-efficiency-challenge-...) on a different workload, but we couldn't resist writing it in a tongue-in-cheek way that poked fun at the whole concept of published benchmarks. It just seems wrong to take them too seriously. I hope most readers took the closing message to heart: benchmarks are just one data point among many.

That's why I appreciate the comment "All Benchmarks Are Liars" on the "limitations" section of this benchmark -- something we can agree on :)


Benchmarks can be quite difficult to interpret when the test parameters vary between tests. However, I think the point in providing the open-source ClickBench benchmark [https://github.com/ClickHouse/ClickBench] is exactly to allow users to do their own research in providing a standardized client and workload across any SQL-based DBMS. Standardized benchmarking is an important technique, for comparing across different applications, or also for comparing the same application across different environments (compute, storage, cloud vs. self-managed, etc.). SPEC [https://www.spec.org] used to do a great job in developing and releasing standardized benchmarks, although their activity has waned of late


Any idea why Druid performed so poorly though? 100x slower seems odd. I though druid was reasonably good at single table analytics like in this benchmark. Is it the small data size?


It does seem odd, especially since in real world cases I'm more accustomed to seeing Druid and ClickHouse be in the same ballpark of performance. Sometimes one is somewhat faster than the other. But in my experience that's more like 2–3x difference in one direction or the other depending on workload, not 100x.

Hard to say why more of a difference shows up here, since I haven't analyzed the benchmark. It's possible the Druid configuration is suboptimal in some way. It's also possible it has something to do with the setup. It appears that the ClickHouse tests were done using a local table, which there isn't an equivalent of in Druid. Druid treats every table like what ClickHouse would call a "distributed" table. My understanding is using a distributed table in ClickHouse adds overhead since the system can no longer assume all data is on a single server. It may be that using distributed queries in both systems would yield a different result. And of course it may be that some of the test queries exercise functionality where ClickHouse is legitimately better optimized. But, again, hard to say anything for certain without detailed analysis.


In this benchmark, druid was killed and restarted after every query because druid seems to get into a degraded state otherwise. Very very likely that there is something wrong with the druid setup here. It would have been useful to know basic details such as druid version.


You can see that setup used is the one provided in a package: single-server/medium. It makes sense to improve setup, but I recommend to provide better configuration by default. I think it is a common courtesy that system should just work at most cases without hard tuning.


Many queries that did do not run had aggregations over strings like MIN/MAX. I don't know specifics why many Java based DBMS lack this aggregation functions.


Very well said!


It looks like the queries are all single table queries with group-bys and aggregates over a reasonably small data set (10s of GB)?

I'm sure some real workloads look like this, but I don't think it's a very good test case to show the strengths/weaknesses of an analytical databases query processor or query optimizer (no joins, unions, window functions, complex query shapes ?).

For example, if there were any queries with some complex joins Clickhouse would likely not do very well right now given its immature query optimizer (Clickhouse blogs always recommend denormalizing data into tables with many columns to avoid joins).


There are many limitations of this benchmark, indeed: https://github.com/ClickHouse/ClickBench/#limitations


There are several existing benchmarks that test query optimisers with a lot of joins. It does not show performance of query engine, but more likely how good is your optimiser was tailored for this queries.


I think your missing my point. The page is entitled "a Benchmark For Analytical DBMS" not "A Benchmark for Single Table Query Execution". Most analytical workloads are more complex then single table queries.

I didn't say it wasn't useful to test single table columnstore performance on workload that runs best on single host databases, just that this isn't the be-all end-all of Analytical Database performance testing.


You are absolutely right. That's why this benchmark is named "a Benchmark For Analytical DBMS", not "the definitive benchmark for analytical DBMS".


There's a lot more involved in an execution engine running complex queries that are not single table group by than just QO (though this is important). It includes things like join implementations and associated optimizations, shuffle performance (which is important even for single table queries as you scale), etc.


> It does not show performance of query engine, but more likely how good is your optimiser was tailored for this queries.

you can join just two large tables without leaving much space for query optimizer.


But isn't that the main goal of analytical databases? They're not for data-warehousing


I’m somewhat convinced that the “difference” between OLAP and “data warehouses” is shady advertising.

Structurally they’re really similar, I suspect some vendors couldn’t match the outright performance of existing OLAP db’s, so added extra features to differentiate it enough to justify a new product category, and then talk endlessly about how OLAP databases aren’t capable of handling this brave new future; even though for the majority of workloads, people would be better off just going with a “boring” OLAP database.

Large parts of this comment are directed pointedly at Snowflake.


I think it's more a matter of comparing minivans (cloud "DWH" engines) to sports cars (Clickhouse et al) here.

Snowflake's performance characteristics & ops paradigm have always been more consistent with managed Spark than anything else. Thus the competition with Databricks. They have only recently started pretending to be anything than a low-maintenance batch processor with a nice managed storage abstraction, and their pricing model reinforces this.

That being said, for now it's pretty hard currently to find something that gives you: - Bottomless storage - Always "OK" performance - Complete consistency without surprises (synchronous updates, cross table transactions, snapshot isolation) - The ability to happily chew through any size join and always return results - Complete workload isolation

...all in one place, so people will probably be buying Snowflake credits for a few years yet.

I'm excited about the coming generation--c.f. StarRocks and the Clickhouse roadmap--but the workloads and query patterns for OLAP and DWH only overlap due to marketing and the "I have a hammer" effect.

I don't think the slight misuse of either type of engine is bad at small-to-medium scale, either. It's healthy to make "get it done" stacks with fewer query engines, fewer integration points, and already-known system limitations.


I don't know where you draw the line between SQL analytics and SQL data warehousing.

I think your typical analytical workload definitely involves more data then this benchmark though. Something like DuckDB is more ideal for this small of a data set . 10s of GB of data can be analysed on a laptop - you don't need a full fledged database server.


DuckDB is included in this benchmark. But there were many OOMs in this benchmark and it was not easy to make it working: https://github.com/duckdb/duckdb/issues/3969

The data size is 75 GiB in uncompressed CSV and 13.7 GiB in Parquet.


A small suggestion: since the machines aren’t always comparable (eg Redshift) maybe it makes sense to allow the user to make a plot of ($/hr) vs query time? I realize this adds to the complexity, but most people in this space are making a trade off between (maintainability/ease-of-use, long-running costs, performance), and plotting the latter two would probably highlight the Pareto frontier.


Yes, this is what I want to include - the cost metric.

Although it will be tricky for systems with consumption-based pricing: BigQuery and Athena.

And Snowflake has almost consumption based pricing - while you pay for warehouse time, it can quickly spin it up and down based on load.

Maybe use some assumption like - the price for running queries on consumption based systems is equalized to one hour of run time of other systems...


Yea, Redshift looks like it was run on an ra3.xl (4 cores) and is comparing against clickhouse on a c6a.4xl (16 cores). I suspect if this were normalized Redshift would have an edge.


Redshift configuration was selected to get 16 cores in total, so it is 4 nodes of ra3.xplus 4vCPU to get 16vCPU in total.

https://github.com/ClickHouse/ClickBench/tree/main/redshift

We can add more results on different configurations easily.

There is also Redshift Serverless, which does not have any hardware configuration to tune.


That makes sense, thanks for clarifying. I got confused by the UI.


I've also rerun the benchmark on 4x ra3.4xlarge (48 cores in total) and added it to the comparison.


My company recently started to invest more into analytics and we had to find a good solution on which backend/database we want to decide. After some research we settled on ClickHouse and we couldn't be happier.

- Super easy to setup - Easy to backup - Fast configuration (documentation could be better at some parts) - Similar SQL dialect as our devs use in MySQL

Only negative points I could find so far: - No 'good' management GUI as e.g. phpMyAdmin, pgAdmin, Mysql Workbench - Caching/Batching layers not directly implemented, but through external software

As we are a fairly small company all other analytical databases would have cost us a large amount of money/time more. Friends of us recently hired a group of data engineers/analytics who also brought all their AWS knowledge and toolings with them, which basically brought them to the same outcome as us, while we only have 5% of their costs and all our devs are able to either ingest or query some data

EDIT: Does anyone have some recommendations on what GUIs I could give our PMs to work with ClickHouse instead of writing queries? All SaaS I found didn't support ClickHouse yet or would cost us a newborn. Also what tools do your devs use when they work with ClickHouse data?


I like a lot of things about Clickhouse but one thing I'm afraid of is what happens when your data won't fit on a single machine. The replication and sharding seems pretty difficult and from reading the documentation feels like it might be pretty fragile.

I think once you reach that scale, systems that completely separate data and compute (like snowflake or trino+s3) are much less of a pain to run since even if you completely blow up your compute the data stays.

> Does anyone have some recommendations on what GUIs I could give our PMs to work with ClickHouse instead of writing queries? All SaaS I found didn't support ClickHouse yet or would cost us a newborn. Also what tools do your devs use when they work with ClickHouse data?

I think both Superset and Metabase were interesting choices - if you want to save money (at the expense of engineering time) you can self-host them.


> I like a lot of things about Clickhouse but one thing I'm afraid of is what happens when your data won't fit on a single machine.

ClickHouse sharding and replication is not that hard to master--it's simple and the parts are visible. If you don't want the headache of distributed system management, run it in Kubernetes or a managed service. ClickHouse-as-a-service is widely available from multiple vendors.

ClickHouse is also fast and cost-efficient at scale. It's a very good fit for multi-tenant SaaS analytics where you need fixed latency on responses to users.

Disclaimer: I work for Altinity, who run a cloud platform for ClickHouse.


Ebay wrote a great blog post about clustering clickhouse

https://tech.ebayinc.com/engineering/ou-online-analytical-pr...


--- The replication and sharding seems pretty difficult and from reading the documentation feels like it might be pretty fragile.

That's why trip.com replaced Clickhouse with StarRocks, a free (open source) OLAP database that handles sharding beautifully to give you linear scalability, and better query performance for single table or joined queries.


We built a Managed ClickHouse service to help exactly these difficulties with that technology. We are handling sharding, clustering, zookeeper, patching, updates without downtime, and Hybrid storage based on S3. https://double.cloud


How many Clickhouse as a service offerings exist now? I stopped counting at 7 a few months ago (double.cloud was not on my list).


Could you share a list of them?


  - Firebolt (Hard fork of clickhouse)
  - Altinity
  - Gigapipe
  - Hydrolix
  - Bytehouse.cloud
  - https://clickhouse.com/ ("coming soon")
  - TiDB (Their columnstore is a fork of clickhouse)
I stopped tracking after this. I saw a few press releases go by announcing a few others as well which I lost now.

The official Clickhouse Inc. is surely going to be under pressure to pull features out of their open source offering over time to differentiate themselves.


ClickHouse can run in a classic shared-nothing setup and in "cloud-native" setup with shared storage. Setting up a distributed system with hundreds of machines can be difficult... but it's actually not more difficult than for any other distributed system at this scale.


Just in case, here's built-in batching https://clickhouse.com/docs/en/operations/settings/settings/... and a list of mature UIs https://clickhouse.com/docs/en/connect-a-ui (that all have SaaS offerings I believe) and all third-party UIs https://clickhouse.com/docs/en/interfaces/third-party/gui


Thanks!

I remember trying the built-in batching but we had some trouble with it, so we just switched to <https://github.com/nikepan/clickhouse-bulk> which works without any issue since then.

Will have a look at the UIs listed there


Tableau is second to none for dataviz imo. Much more versatile than the competition. You can do interactive visualizations, set thresholds, color code/label data, etc. Of course it costs a lot, but if dataviz is something that is delivering real value the cost is more than worth it.


Clickhouse support for Tableplus landed 2 months ago. https://github.com/TablePlus/TablePlus/issues/670


Anyone got a succinct explanation for why ClickHouse is doing so well in this kind of benchmark? What are the key tricks they're using to get great analytical performance, and why haven't those same tricks been adapted by other competing systems?


I see that other systems often forgot many small details.

Imagine you are implementing COUNT(DISTINCT) function. And imagine it is used in GROUP BY something, like GROUP BY region_id. And on realistic datasets, you will have a large number of different region_id and something like power law distribution of their frequency: https://en.wikipedia.org/wiki/Zipf%27s_law

Then you will expect that most of the values of COUNT(DISTINCT) will be small. And the natural optimization will be - using a linear array before using a hash table. Or: place smaller hash tables in something like a memory pool.

This is about a ton of optimizations on distributions of real datasets.


I would say deep optimizations in every aspect of the data cycle. From storing to querying data is heavily optimized. I recall an online meetup where a 10ms gain after a patch was huge celebrated. That kind of things compound over and over and make a really fast data platform for that use cases.


Seems like a great ad for c6a.metal !

Without dollars as part of the analysis, I don't find these comparisons useful.


it's curious there is no available c6a.metal today in eu-west-2, I would say last week there were available in this region.


Can BigQuery be added? I'm interested to know how it compares in these sorts of scenarios.

We use BQ because I do NOT want to have to deal w/management of any of the infrastructure; paying by query is convenient for our small team. But if there is true value that can be realized by something like ClickHouse, I want to see how it compares.


I've added BigQuery and run the benchmark and get the results... but due to their ToS , the benchmark results cannot be published. See https://cube.dev/blog/dewitt-clause-or-can-you-benchmark-a-d...

The benchmark instructions for BigQuery are here: https://github.com/ClickHouse/ClickBench/tree/main/bigquery

If you follow them, just add the .json with the results and regenerate the HTML with the generate-results.sh script. You will get the version of the benchmark page with BigQuery included - for your own use.

I wish more vendors will get rid of DeWitt Clause sooner. For example, Snowflake has removed it a few months ago (maybe due to pressure from Databricks), SingleStore also removed the clause recently.


Much appreciated.


Wow Snowflake absolutely crushes these benchmarks. Anyone know why? I don't know much about this space.


The snowflake clusters are several factors bigger (and more expensive) than almost all of the comparison points.


README.md was interesting to read. I enjoyed learning about the history of the effort and I appreciate how much work got it to this point.

I enjoy both benchmarking and benchmarketing. Although benchmarketing is more fun when it isn't about the DBMS I am working on.


This is great! Will it be possible to include Databricks in the benchmark?


Databricks is in the list of systems I want to add: https://github.com/ClickHouse/ClickBench/#systems-included


Yeah, also a comparison with SparkSQL on Kubernetes would be interesting! Let's get in touch, I may help with that.


Do you have any benchmark results with Clickhouse in a clustered configuration (and other db's)? Is Clickhouse expected to be run in only single node deployments?


ClickHouse is typically run on a cluster. There are setups with over 1000 machines and over 100 PB of data.

But for this benchmark, the most simple setup is selected. I can add clustered setup for ClickHouse in this benchmark.


[dead]


> Regarding the joined table queries that are missing in the tests

looks like huge issue with this benchmark.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: