There is some creative engineering going here :) have a look: https://github.com...

csdvrx · on Oct 21, 2021

> There is some creative engineering going here

Agreed. At a previous work, clickhouse outperformed timescale by several orders of magnitude, under about every condition.

The timescale team seems to recognize that (look for the comment about clickhouse being a bulldozer) but they seem to say timescale can be better suited.

In my experience, in about 1% of the cases, yes, timescale will be a better choice (ex: if you do very small batches of insertions, if you need to remove some datapoints) but in 99% of the usecases for a time series database, clickhouse is the right answer.

There seems to have been several improvements to timescale since 2018, with columnar storage, compression, etc. and that's good because more competition is always better.

But in 2021, clickhouse vs timescale for a timeseries is like postgres vs mongo for a regular database: unless you have special constraints [*], the "cool" solution (timescale or mongo) is the wrong one.

[*]: you may think you have a unique problem and you need unique features, but odds are, YAGNI

https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it

akulkarni · on Oct 21, 2021

> In my experience, in about 1% of the cases, yes, timescale will be a better choice (ex: if you do very small batches of insertions, if you need to remove some datapoints) but in 99% of the usecases for a time series database, clickhouse is the right answer.

I always find comments like this interesting :-). Things are better for different use cases. If you find yourself inserting a lot of data in batches for OLAP-style analysis, then ClickHouse is a better choice today.

If you find yourself performing a lot of time-series related queries, and needing to build an application on top (e.g., where you might want the OLTP features of Postgres), then Timescale is the better choice.

YMMV! And that's OK :-)

> But in 2021, clickhouse vs timescale for a timeseries is like postgres vs mongo for a regular database: unless you have special constraints [*], the "cool" solution (timescale or mongo) is the wrong one.

This is also a funny statement, because TimescaleDB is built on PostgreSQL.

We actually take great pride in being a "boring" option [0] - in fact I think TimescaleDB is many ways is more "boring" than ClickHouse (again, because of its PostgreSQL foundation). But I think that's actually a good thing - because you should want your database to be "boring" - ie you shouldn't have to worry about it!

(Disclaimer: TimescaleDB co-founder)

[0] https://blog.timescale.com/blog/when-boring-is-awesome-build...

csdvrx · on Oct 22, 2021

> This is also a funny statement, because TimescaleDB is built on PostgreSQL.

I know, but doing timeseries with postgres is "cool", not standard, not boring. I'd even say "risky".

> We actually take great pride in being a "boring" option

No, you're not there yet: doing timeseries with timescale is way riskier than with clickhouse, which is both a bit older (not much) and more mature (much more), while also being more widely used (even if you are doing a lot of outreach like these posts)

akulkarni · on Oct 22, 2021

> No, you're not there yet: doing timeseries with timescale is way riskier than with clickhouse, which is both a bit older (not much) and more mature (much more), while also being more widely used

This is not true at all, and we explain why in the post:

1. TimescaleDB's reliability is PostgreSQL's reliability. ClickHouse has a lot of advantages, but "more reliable than PostgreSQL" is not one of them.

From the post:

  PostgreSQL has the benefit for 20+ years of development and usage, which has resulted in not just a reliable database, but also a broad spectrum of rigorously tested tools: streaming replication for high availability and read-only replicas, pg_dump and pg_recovery for full database snapshots, pg_basebackup and log shipping / streaming for incremental backups and arbitrary point-in-time recovery, pgBackrest or WAL-E for continuous archiving to cloud storage, and robust COPY FROM and COPY TO tools for quickly importing/exporting data with a variety of formats. This enables PostgreSQL to offer a greater “peace of mind” - because all of the skeletons in the closet have already been found (and addressed).

2. ClickHouse, being a newer database, still has several "gotchas" with reliability: e.g., No data consistency in backups (because of its lack of support for transactions and asynchronous data modification)

From the post:

  One last aspect to consider as part of the ClickHouse architecture and its lack of support for transactions is that there is no data consistency in backups. As we've already shown, all data modification (even sharding across a cluster) is asynchronous, therefore the only way to ensure a consistent backup would be to stop all writes to the database and then make a backup. Data recovery struggles with the same limitation.

  The lack of transactions and data consistency also affects other features like materialized views because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.

Now this trade-off - accepting less reliability for faster OLAP queries - may be fine with you. And that's OK. But stating that ClickHouse is more reliable than PostgreSQL/TimescaleDB is just not true.

hodgesrm · on Oct 22, 2021

You are right about transactional differences between ClickHouse and PostgreSQL but you are comparing apples and oranges. ClickHouse prioritizes speed, efficiency, and scale over consistency. These are reasonable choices, especially in the largely append-only use cases which dominate analytics.

1. I've seen relatively few messed up source tables and mat views over thousands of support cases. When they happen they can be bad for some use cases like financial analytics. They simply aren't very common. And for use cases like observability or log management it just doesn't matter to have a few lost or duplicated blocks over huge datasets.

2. ClickHouse overall is eventually consistent. There are generally differences between replicas when load is active, yet it causes relatively few practical problems in most applications as they load balance queries over replicas. Serialization is expensive and simply not very highly valued here.

3. ClickHouse uses other mechanisms than ACID transactions to ensure consistency. One good example is discarding duplicate blocks on insert into replicated tables. If there's any doubt whether an insert succeeded, you can just insert the block again. ClickHouse checks the hash and discards it. This is incredibly efficient and works without requiring expensive referential integrity (e.g., unique indexes).

4. It's just about always possible to get ClickHouse to boot even when you have corrupt underlying data (e.g., due to file system problems). I don't know how you define reliability but at least in this sense ClickHouse is extremely robust. I've never seen a server fail to start, though you might need a bit of surgery beforehand.

5. ClickHouse doesn't have transactional DDL. What it does have is features like altering tables to add new columns in a fraction of a second without locking regardless of the size of the dataset. Its behavior is close to NoSQL in this regard.

I could go on, but I think these points illustrate that ClickHouse has a different set of design choices for different problems. I would never use it for eCommerce, but it's great for analytics.

Disclaimer: I work on ClickHouse at Altinity.

akulkarni · on Oct 22, 2021

   I could go on, but I think these points illustrate that ClickHouse has a different set of design choices for different problems. I would never use it for eCommerce, but it's great for analytics.

I agree with this. You are poking at a straw man.

My reply was in response to this comment by the OP:

> No, you're not there yet: doing timeseries with timescale is way riskier than with clickhouse, which is both a bit older (not much) and more mature (much more), while also being more widely used

TimescaleDB - which some don't realize - inherits all of the reliability of PostgreSQL, i.e., the 20+ years of usage and tuning (and broad tooling ecosystem).

What I was disproving was the statement that TimescaleDB/PostgreSQL was somehow riskier than ClickHouse.

ClickHouse is impressive, but deployments are still far behind that of PostgreSQL. ClickHouse is also younger and less mature than PostgreSQL.

I can see that you are the CEO of Altinity. Nice to meet you. I'm the CEO of Timescale. I think it's important that we strive for transparency in our industry, which includes admitting our own product's shortcomings, and to accept valid criticism.

We've done that many times in this HN thread (and in the blog post). I think we would have had a more productive discussion in this HN thread if ClickHouse developers were also as transparent with ClickHouse's shortcomings.

I'm happy to continue this conversation offline if you'd like. The database market is large, the journey is long, and in many ways companies like ours are fellow travelers. ajay (at) timescale (dot) com

hodgesrm · on Oct 22, 2021

Hi Ajay! Thanks for the thoughtful response and email. I would love a direct meeting and will contact you shortly.

I don't mean to gloss over ClickHouse imperfections. There are lots of them. For my money the biggest is that it still takes way too much expertise in ClickHouse for ordinary developers to use it effectively. Part of that is SQL compatibility, part of it is lack of tools of which simple backup is certainly one. To the extent that ClickHouse is risky, the risk is finding (and retaining) staff who can use it properly. Our business at Altinity exists in large part because of this risk, so I know it's real.

The big aha! experience for me has been that the things like lack of ACID transactions or weak backup mechanisms are not necessarily the biggest issues for most ClickHouse users. I came to ClickHouse from a long background in RDBMS and transactional replication. Things that would be game ending in that environment are not in analytic systems.

What's more interesting (mind-expanding even) is that techniques like deduplication of inserted blocks and async multi-master replication turn out to be just as important as ACID & backups to achieve reliable systems. Furthermore, services like Kafka that allow you to have DC-level logs are an essential part of building analytic applications that are reliable and performant at scale. We're learning about these mechanisms in the same way that IBM and others developed ACID transaction ideas in the 1970s--by solving problems in real systems. It's really fun to be part of it.

My comment didn't convey this clearly, for which I heartily apologize. I certainly don't intend to portray ClickHouse as perfect and still less to bash Timescale. I don't know enough about the latter to make any criticism worth reading.

p.s., Non-transactional insert (specifically non-atomicity across blocks and tables) is an undisputed problem. It's being fixed in https://github.com/ClickHouse/ClickHouse/issues/22086. Altinity and others are working on backups. Backup comes up in my job just about every day.

rkwasny · on Oct 21, 2021

Also this queries are different? order by "time" vs order by "created_at"

https://github.com/timescale/tsbs/blob/a045665d9c94426bbc405...

ryanbooz · on Oct 21, 2021

We were using tags, so that "else" block isn't the one being used for ClickHouse. Regardless, the table that is created (by the community and verified by former CH engineers) orders by created_at, not time and so that query should be the "fastest" the distinct possible.

ryanbooz · on Oct 21, 2021

(Post author)

I'm not sure why you think that's creative engineering. What you're pointing to is the depth of available configuration that the contributors to TSBS have exposed for each database. It's totally open source and anyone is welcome to add more configuration and options! I believe (although not totally sure) that Altinity and ClickHouse folks added their code a few years ago - at least it wasn't anyone on the Timescale team.

That said, we didn't actually use those scripts to run our tests. Please join us next Wednesday (10AM ET/4PM CET) to see how we set the databases up and ran the benchmarks. We'd be delighted to have you try it on your own too!

rkwasny · on Oct 21, 2021

Ah so the tests you have used are not the ones in https://github.com/timescale/tsbs ?

ryanbooz · on Oct 21, 2021

All the same tests. You simply pointed to a shell script that's configurable to run tests for each database. We provided details in the blog post of exactly what settings we used for each database (cardinality, batch size, time range, TimescaleDB chunk size, etc.) so you can use those script to configure and run the tests too.