how does Hydra compare to Citus? https://www.citusdata.com

jerrysievert · on Aug 3, 2023

generally faster across the board, a lot of work was done to expand and speed it up, plus updates, deletes, and vacuuming.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

rubiquity · on Aug 3, 2023

Since benchmarks can be misleading I want to point out that the differences between Hydra and the "tuned"[0] PostgreSQL (which are some very basic settings) are a lot less convincing, with plain old PG coming ahead on quite a few: https://tinyurl.com/eju9tht2

I also noticed quite a bit of parity between Hydra and Citus on data set size. Is Hydra a fork of Citus columnar storage?

0 - https://github.com/ClickHouse/ClickBench/blob/main/postgresq...

arp242 · on Aug 3, 2023

> plain old PG coming ahead on quite a few

I found that is common among these types of databases (e.g. Citus, Timescale, etc.) which perform well under very specific conditions, and worse for many (most?) other things, sometimes significantly worse.

That said, Hydra does take up ~17.5G for that benchmark and "PostgreSQL tuned" about 120G, the insert time is ~9 times faster, and "cold run" is quite a bit faster too. It's only "hot run" that shows a fairly small difference. I think it's fair to say Hydra "wins" that benchmark.

> Is Hydra a fork of Citus columnar storage?

Yes: "Hydra Columnar is a fork from Citus Columnar c. April, 2022".

riku_iki · on Aug 3, 2023

> Hydra does take up ~17.5G for that benchmark and "PostgreSQL tuned" about 120G

you can run pg on compressed filesystem

arp242 · on Aug 3, 2023

I'm sure you can, but AFAIK neither uses compression in that benchmark so it's a fair comparison. Even if filesystem compression would reduce that to 17.5G (doubtable), it won't be free in terms of CPU cycles, and no matter what it's still ~120G to load in memory, bytes to scan/update, etc.

riku_iki · on Aug 3, 2023

my bet is that hydra uses compression inside already, otherwise it is hard to explain where difference comes from.

> it won't be free in terms of CPU cycles

it can reduce IO traffic significantly, and it can be very positive trade off depending on circumstances.

arp242 · on Aug 3, 2023

I had assumed that PostgreSQL is so much larger because it creates heaps of indexes (which is probably also why inserts are so much slower for it), but I don't really have a good way to confirm that quickly.

riku_iki · on Aug 3, 2023

one can choose to not create "heaps of indexes".

arp242 · on Aug 3, 2023

At which point your performance will drop like a brick for these types of queries – I'm pretty sure these indexes weren't added for the craic.

riku_iki · on Aug 3, 2023

it depends on your query obviously.

In general, I did very deep benchmarking of pg, clickhouse and duckdb, and I sure didn't make stupid mistakes like this: https://news.ycombinator.com/item?id=36990831

My dataset has 50B rows and 2tb of data, and I think columnar dbs are very overhiped and I chose pg because:

- pg performance is acceptable, maybe 2-5x times slower than clickhouse and duckdb on some queries if pg is configured correctly and run on compressed storage

- clickhouse and duckdb start falling apart very fast because they specialized on very narrow type of queries: https://github.com/ClickHouse/ClickHouse/issues/47520 https://github.com/ClickHouse/ClickHouse/issues/47521 https://github.com/duckdb/duckdb/discussions/6696

arp242 · on Aug 4, 2023

"2-5x times slower" can mean the difference from 2 seconds to 4 to 10 seconds. Two seconds is still (barely) acceptable for interactive usage, ten seconds: not so much. You're also going to need less beefy servers, or fewer servers.

I also "just" use PostgreSQL for all of this by the way, but the limitations are pretty obvious. You're much more limited in what you can query with good performance, unless you start creating tons of queries or pre-computed data and such, which have their own trade-offs. Columnar DBs are "overhyped" in the sense that everything in programming seems to be, but they do exist for good reasons (the reason I don't use it are because they also come with their own set of downsides, as well as just plain laziness).

zX41ZdbW · on Aug 3, 2023

ClickHouse can do large GROUP BY queries, not limited by memory: https://clickhouse.com/docs/en/sql-reference/statements/sele...

riku_iki · on Aug 3, 2023

as explained in https://github.com/ClickHouse/ClickHouse/issues/47521#issuec... it can't, that parameters only applies on pre aggregation phase but not aggregation.

Feature request is not implemented yet: https://github.com/ClickHouse/ClickHouse/issues/40588

zX41ZdbW · on Aug 4, 2023

ClickHouse uses "grace hash" GROUP BY with the number of buckets = 256.

It can do size about 256 times larger than a memory because only one bucket has to be in memory while merging. It works for distributed query processing as well and is enabled by default.

About the linked issue - it looks like it is related to some extra optimization on top of what already exists.

riku_iki · on Aug 4, 2023

> only one bucket has to be in memory while merging.

its hard for me to judge about implementation details, but per that person reply memory is also multiplied by number of threads which do aggregation.

benn0 · on Aug 3, 2023

Do you have happen to have any documentation about your benchmarking? I'm also considering these options at the moment (currently using pg+timescaledb) and interested in what you found.

riku_iki · on Aug 3, 2023

I don't have documentation.

I just created large tables, and tried to join, group by, sort them in pg, clickhouse, duckdb, looked what failed or being slow, and tried to resolve it.

I am happy to answer specific questions, but I didn't use timescaledb.

riku_iki · on Aug 3, 2023

> 0 - https://github.com/ClickHouse/ClickBench/blob/main/postgresq...

that postgres config is very underpowered, it has only 8 workers per gather while machine has 192 vcpus.

re-thc · on Aug 4, 2023

Submit a PR?

riku_iki · on Aug 4, 2023

I am not sure what is the process, who will rerun benchmarks then?

adr1an · on Aug 3, 2023

Right when I was thinking URL shorteners were out of fashion... /S

biugbkifcjk · on Aug 3, 2023

It's just there to make it easier for mobile users to click it..

setr · on Aug 3, 2023

I don’t see why the GitHub link is any harder to click than the tiny url link in that post.

I’m pretty sure the only reason url shorteners exist with purpose is because of Twitter limits (and software that doesn’t visually hide egregiously long urls), but continues to be used outside of those places due to cargo culting