I'll answer this here with a similar response that I gave Pradeep (the author) via Twitter.
I think ClickHouse is a great technology. It totally beats TimescaleDB for OLAP queries. I'll be the first to admit that.
What our (100+ hour, 3 month analysis) benchmark showed is that for _time-series workloads_, TimescaleDB fared better. 
Pradeep's analysis - while earnest - is essentially comparing OLAP style queries using a dataset that is not very representative of time-series workloads. Which is why the time-series benchmark suite (TSBS)  exists (which we did not create, although we now maintain it). I've asked Pradeep to compare using the TSBS - and he said he'd look into it. 
As a developer, I'm very wary of technologies that claim to be better at everything - especially those who hide their weaknesses. We don't do that at TimescaleDB. For those who read our benchmark closely, we clearly show where ClickHouse beats TimescaleDB, and where TimescaleDB does better. And - despite what many commenters on here may want you to think - we heap loads of praise on ClickHouse.
As a reader of HackerNews, I'm also tired of all the negativity that's developing on this site. People who bully. People who default to accusing others of dishonesty instead of trying to have a meaningful dialogue and reach mutual understanding. People who enter debates wanting to be right, versus wanting to identify the right answer. Disappointingly, this includes some visible influencers whom I personally know. We should all strive to do better, to assume positive intent, and have productive dialogues.
(This is why one of our values at TimescaleDB is "Assume Positive Intent."  I think Hacker News - and the world in general - would be a much better, happier, healthier place if we all just did that.)
The results which TimescaleDB showed to me seems to show what it is better than ClickHouse in TSBS benchmark (or particular configuration) not for Time Series workloads in general.
In my experience "Time Series" workloads can be defined very broadly (by casual user) and querying log of events can be often seen as such
So while we can debate on an academic level what a "time-series" workload is, if we were to look at the facts we will find that the answer is far more specified that you may think.
Also, Peter, I wonder if you should be more forthcoming with your ClickHouse affiliation.
Everyone reading this thread is aware of my bias because I clearly state my TimescaleDB affiliation. But I didn't realize until very recently (when someone pointed this out to me) that you are affiliated with ClickHouse - eg perhaps as an investor or even founder in Altinity?
It is best practice on Hacker News to be forthcoming with affiliations so that readers can make their own decisions on how to correct for any natural biases made by commenters.
Even looking at your benchmark queries, I'm confused what value it provides over a standard OLTP or OLAP setup.
Often there are other time-related stuff in there as well, but I think the vast majority of use is fast calculation of "mean/max/first value of sensor(s) for X-second buckets".
I understand the "spirit" of your comment and I agree in general.
However something unique that comes to my mind about some TSDBs is automated aggregation of data points (using min/max/avg/whatever functions) into something less granular after that a metric's specific time becomes older than X.
By using a TSDB for example you'll be able to look at datapoints with a max resolution of 10 seconds for metrics collected during the past 7 days, but after that their max resolution will be (aggregated into) e.g. 60-seconds intervals, and so on.
I think that the theory behind it is that you're probably interested at the details for recent stuff, but the older the stuff gets the more you just want to look at the general trend without caring about the details. I can agree with that. In the end by doing these kinds of aggregations in theory everything should be faster & should use less storage (as for older data there are fewer data points).
This is how "Graphite/Carbon" ( https://graphite.readthedocs.io/en/latest/faq.html ) works.
I did use Graphite/Carbon for some years, but I didn't like a lot its architecture and had some performance problems => I've replaced it with Clickhouse (I'm not doing any kind of data aggregation) and that's using less space and is quicker (respectively it uses a lot less CPU & I/O) :)
The main requirements for time series databases:
- Fast data ingestion (millions of rows per second).
- Good compression for the stored data, since the amounts of time series data is usually huge (trillions of rows per node). The compression also may improve query speed, since it reduces the amounts of data that needs to be read from disk during heavy queries.
- Fast search for time series with the given labels. For instance, search for temperature measurements across all the sensors in the given country with millions of temperature sensors.
- Fast rows processing for the found time series on the given time range. Usually the number of rows to process exceeds hundreds of millions per query.
Typical OLTP databases cannot meet these requirements.
I would not want to try this on a traditional relational database kernel, they are not designed for workloads that look like this. They optimize their tradeoffs for slower and more complicated transactions.
People are cross-shopping Clickhouse/TimescaleDB rightly or wrongly, and it's not clear to community when they should use which. What overlaps on the Venn-diagram and what doesn't, and when they do why would I go one way or the other.
You have to do a better job of showing how you're solving customer problems. Benchmarks are next to useless, an unsolvable problem, I wouldn't waste time on it. If customers are succeeding on your platform, you'll succeed.
The main advantage from my perspective, is that you can query across data business data and time series data with all the advantages that Postgres has. Time series data while useful on its own, becomes incredibly powerful when it can be combined with your business and production data.
A great example is our outbound network data monitoring. We use pmacct http://www.pmacct.net/ to send network flows to Postgres from our firewall, host inventory data in Postgres, and a foreign data wrapper around our LDAP data to determine user / host assignment, and from that we can correlate every data flow to the user who is assigned to the host that generated that particular flow. This makes for some pretty powerful security reporting. Outside of that, we use Timescale's hypertables in a number of places that aren't explicitly timeseries data, like syslog data, web server logs, etc. This allows for some pretty amazing reporting on log data that is timeboxed, like "give me all the 500 errors from our HTTP log that have an ip address in Finland (did I mention that we load GeoIP data into Postgres every night) in the last 3.5 hours.
Timescale is excellent on its own, and honestly competitive with other TSDB's on its own. Having access to the full Postgres ecosystem with your timeseries data makes Timescale way ahead of everyone else. My story might change when I hit the limits of what a single Postgres host can ingest, but I'm not even close to that scale yet.
Other advantages of Timescale, is having access to real SQL, you don't have to learn a new domain specific query language, you can just use SQL. This admittedly can be a double edge sword. SQL is more complicated than PromQL / InfluxQL, however that comes with quite a lot of extra capability, and the ability to transfer that knowledge into other domains.
I personally really like Timescale, and feel that regardless of anyones benchmarks, no matter how well thought out or not, the advantages outweigh the disadvantages by a pretty large margin.
The article mentioned flaws of Clickhouse which in my opinion in the context of a TSDB are irrelevant (e.g. "no transactions", "inability to modify data at a high rate", etc...). I'm saying this because in my mind I associate TSDBs to "server metrics collection", therefore it's no big deal even if some/many datapoints are lost, there is usually no need to modify that data, and so on. But I might be wrong, maybe the usecases that you have in your mind are different (e.g. transactional accounting data?).
About deleting data: tables that host timeseries data in Clickhouse are usually partitioned by the fraction (day/month/year) that has to be deleted later => dropping one or multiple such partitions is easy & fast & extremely light on the system (it just gets rid of the underlying files and directories).
The article didn't directly show SQLs about how the tables were defined nor about how the tests were performed, you linked your benchmark suite  but I honestly don't want to dig into that as it seems to be complex => to be honest it sounds like something engineered to be better than your antagonist (even if maybe it's not, dunno).
In general Clickhouse has many knobs & levers that can be changed/tweaked which can backfire if not set appropriately, I personally think that whoever uses Clickhouse MUST understand it (I did not at the beginning => got totally screwed up), but at the same time those many knobs & levers provide a lot of flexibility. Btw. indirectly they seem like a "filter" to ensure that only people that are able to use that DB will end up using it :P
So, summarized, I raised my eyebrows a couple of times while reading . E.g. Clickhouse does "merges" in the background, and they can queue up (depending on a lot of stuff), and all that can stress a lot the disks & CPU, so I have no clue what was going on when you got your 156% performance advantage against CH. Maybe you're right, maybe you're not, I just don't know, so I didn't trust that article nor I do now.
Maybe you could be right if you would point to the "reliability" of your DB? E.g. because of the "merges" that are triggered at "unknown" intervals by Clickhouse in the background which can in turn create hotspots of CPU&disk on the host(s) and therefore have negative repercussions on many insert/query-ops, your TimescaleDB could definitely have an advantage here (if it doesn't perform deferred maintenance like CH does), but if that's true then in my opinion that was lost in the article .
That being said, ClickHouse also has a ton of clever levers you can pull to squeeze out better performance and compression which aren't used by default, such as using Delta/DoubleDelta CODECs with LZ4/ZSTD compression, etc. Not to mention, MATERIALIZED VIEWs and/or the relatively newer feature MergeTree Projections
Are people using ClickHouse as their timeseries backend? IIRC, Clickhouse doesn't perform all that well with millions of tiny inserts.
Last time I checked I have a few hundred billion rows in the table with a significant compression ratio (not sure off hand). Most importantly, the table is ordered efficiently enough to allow me to query years of metrics (Grafana plugin) at millisecond speed.
Side note, I recall ClickHouse developers mentioning they are currently working on an implementation change which will allow many tiny inserts to be much more performant and realistic to use in the real-world.
Hope this helps!
In my case the data arrived in CSVs with around 20k skus. Had they arrived a couple at a time, I could have created a CSV and written to ClickHouse later or used any of the other storage methods available in ClickHouse.
ClickHouse is perfectly optimized for storing and querying of such time series, including metrics. That's true that ClickHouse isn't optimized for handling millions of tiny inserts per second. It prefers infrequent batches with big number of rows per each batch. But this isn't the real problem in practice, because:
1) ClickHouse provides Buffer table engine for frequent inserts.
2) It is easy to create a special proxy app or library for data buffering before sending it to ClickHouse.
TimescaleDB provides Promscale  - a service, which allows using TimescaleDB as a storage backend for Prometheus. Unfortunately, it doesn't show outstanding performance comparing to Prometheus itself and to other remote storage solutions for Prometheus. Promscale requires more disk space, disk IO, CPU and RAM according to production tests , .
Full disclosure: I'm CTO at VictoriaMetrics - competing solution for TimescaleDB. VictoriaMetrics is built on top of architecture ideas from ClickHouse.
This article has Clickhouse more-or-less spanking TimescaleDB, but the blog post it references is basically the reverse.
Are the use cases just that different?
The only use case where TimescaleDB is more useful is the ability to mutating/deleting single rows but even there, Clickhouse offers some workarounds at the expense of a little extra storage until a compaction is run similar to VACUUM.
Clickhouse is to TimescaleDB what Nginx was to Apache.
Same. I'm ready to believe my experience is not representative, but I've rarely heard something different after talking to people who've seriously evaluated both.
> Clickhouse is to TimescaleDB what Nginx was to Apache.
Perfect comparison. Except I don't remember Apache cooking some tests to pretend they are faster than nginx, or astroturfing communities :)
Different tools serve different purposes, simple as that.
If TimescaleDB or Apache does the job for you, stick with them.
When you will want to scale / increase performance or just rewrite, chose the better option of the day.
In 2021, Clickhouse should be a recommended default, like nginx.
I would just encourage all vendors to be more humble positioning their benchmarks. In my practice production behaviors for better or worse rarely resemble benchmark results
> Overall, although some TimescaleDB queries became faster by enabling compression but many others became bit slower probably due to decompression overhead. This may be the reason why TimescaleDB disable compression by default
This matches my experience: ClickHouse is generally faster, and a better solution for time series (more robust, more mature, ...) unless you have a highly specific set of constrains (ex: must be able to delete individual records, ...) and sacrificing performance for them is an acceptable tradeoff.
I have no doubt that, as usual, akulkarni will make a good PR job / community outreach to explain why, numbers and experience be damned, TimescaleDB is better!
But I suggest interested readers check the history of previous "creative engineering" around tests that has been done to make TimescaleDB come out ahead: https://news.ycombinator.com/item?id=28945903
In 99% of the case, ClickHouse is the right choice, especially if you care about the license not adding too many restrictions.
I have no doubt that, as usual, akulkarni will make a good PR job / community outreach to explain why, numbers and experience be damned, TimescaleDB is better!
But since this is a public forum, I'll answer your comment:
In general: ClickHouse is better than TimescaleDB for OLAP. TimescaleDB is better for time-series. If you don't believe me, that's fine! Each workload is different and you should test it yourself.
p.s. Let's keep HackerNews a more positive place. Negative comments are unnecessary, not productive, and honestly just make the author look immature.
Here, I provided a link to the previous discussion, because personally, I do not appreciate being mislead. I encourage you to check the technical details there if you don't believe me.
But maybe not being 100% positive and supportive is no longer acceptable in 2021? Or maybe it's the complexity of the issues discussed?
So let's give a simpler message: as rkwasni said it best just yesterday: "It's really quite easy, if you don't need DELETE ClickHouse wins every benchmark" https://news.ycombinator.com/threads?id=rkwasny
It's simple as that: if you need deletion, consider TimescaleDB.
For every other conceivable scenario, ClickHouse is likely to come ahead, unless you are doing something very very wrong with it: a virtualization example would be splitting cores across VM with no respect of their shared cache.
When people talk about doing a millions of tiny inserts, it's a bit like that: a misconfiguration. And that's not how it work in the real world: even with plain Postgres, you often use a middle layer to avoid resource issues (increasing max_connections has a cost, that's why pgpool exist!), either directly in your app, or by putting some kind of buffer in front of the real table.
ClickHouse has such features, to automatically handle the flushing to the real table: https://clickhouse.com/docs/en/engines/table-engines/special...
I have spend some serious time with both, think of me what you may, but the CEO of TimescaleDB saying TimescaleDB performance can withstand the comparison with ClickHouse is like Intel marketing department saying Intel CPUs can withstand the comparison with AMD: unless you cook the tests with some highly specific workloads (say with lots of simd/AVX512 stuff, monocore...) to be non representative of the most common scenarios, you're not being honest.
I believe such thinly veiled dishonesty is a much larger problem than a perceived positivity.
Compression is one of the many closed-source/proprietary features in Timescale. Timescale is a great idea, as its just a postgres extension, so no need to add another database, but with such an important feature being proprietary, I end up looking at the fully Open Source ClickHouse and I see the operational overhead of another DB as reasonable trade-off for keeping my stack Open Source and avoiding vendor lock-in.
Here you can see a chart outlining all the features that are proprietary and which are Open Source in the Open-Core TimeScale DB- https://docs.timescale.com/timescaledb/latest/timescaledb-ed...
I suppose we can agree to disagree on this. Perhaps "Open Core, Source Available" is a term you can agree to? I think my original comment was clear that part of Timescale is Open Source, or in other words Open Core.
>> the only thing you cannot do when using the Timescale license is basically pull an AWS move and sell TimescaleDB as a service
Actually, by virtue of this, it prevents me from paying some one else to host a Timescale fork for me... this in turn is a major stumbling block to creating a viable fork if my business interests diverge from Timescales's business interests for any reason. Thus leading to Vendor Lock-in, as per my original comment.
Does anyone have a TimeScaleDB implementation that they love for time-series workloads that they are so happy with that they don't miss the non-timescale benefits of ClickHouse?
I think this series of posts confirms the first law of Benchmarketing - for any system one can come up with "unbiased" benchmark which confirms its superiority
Again, though...great writeup!
TimescaleDB schema in this benchmark uses NUMERIC (variable size, exact precision) versus Float32 or Float64 for Clickhouse schema.
Would be interesting to see the results with TimescaleDB schema updated to more fair REAL (Timescale/Postgres's float32 equivalent) and DOUBLE (float64) columns as per Clickhouse's schema.
There is Clickhouse FDW for PostgreSQL which in some cases can provide great speed with full join support
Good example is what Uber is doing
Or alinity’s great materialized view tutorials.
Clickhouse is unmatched with these workflows.