Whenever I see these posts from TimescaleDB, I always want to ask them how it compares in performance to alternative extensions that implement the same features, rather than just comparing TimescaleDB to vanilla PostgreSQL.
For example, they mention their automated data retention and how it's achieved with one SQL command, and how DELETEing records is a very costly operation, and how "even if you were using Postgres declarative partitioning you’d still need to automate the process yourself, wasting precious developer time, adding additional requirements, and implementing bespoke code that needs to be supported moving forward".
There's zero mention anywhere of pg_partman, which does all of these things for you equally as simply, and is a fully OSS free alternative [0].
I get that it's a PG extension that competes with their product. I know that TimescaleDB does a few other things that pg_partman does not. But I can't help but find its (seemingly) purposeful omission in these, otherwise very thorough blog posts, misleading.
Seconded. I understand that Timescale extends Postgres, but I'd still like to see performance comparisons relative to other time-series databases such as InfluxDB or even TiDB.
Shameless self-plug: we're also building database technology, but for embedding vectors (https://milvus.io) rather than time-series data - when doing query performance comparisons, we get several orders of magnitude of performance improvement over traditional databases. It's an unfair comparison, so we generally avoid it.
Same. I've been experimenting with the recently fully open sourced Citus 11 and it's pretty incredible. To the point that I have a hard time imagining NOT using it going forward.
Maybe I need to do my own comparison at some point.
We find that most developers storing time-series data on Postgres are doing so without pg_partman. So we first wanted to provide a benchmark that would be useful to most developers.
This benchmark was also the result of months of dedicated work. So the team did spend a lot of time on this. Unfortunately, they ran out of time to cover pg_partman. But that comparison is on our list of things to do soon.
Most people storing time-series are also doing that outside of TimescaleDB.
So the relevant question isn't what a typical PostgreSQL user is doing, but whether someone wanting to optimize their storage should look at a PostgreSQL extension, or an upstart database like TimescaleDB.
Though I didn't write this post, I'd imagine at least part of it is that it's already nearly 4000 words and a 15 minute read and we just didn't want to add another set of things to it, to be perfectly honest.
`pg_partman` is cool! I haven't used it in a while, and because it uses declarative partitioning, it has some locking issues that we address with our partitioning scheme, but implying that it is OSS and we're not in terms of things like data retention features is a bit misleading as well. The `drop_chunks` command used for data retention is in the Apache 2 licensed portion of Timescale.
But almost all of your posts and benchmarks are based on the closed source version of Timescale. Everywhere I have seen it is always recommend to use the closed source version to get decent performance out of it.
Just to clarify: Nothing on Timescale is closed-source. It is all source available, all on Github. Some of it is Apache2 licensed, some of it is Timescale Licensed. And it is all free.
Since the Timescale License is not an open source license, it is a closed source license. You are right, it is source available, but source available is also closed source. It is closed because it is not open. And it might be free as in beer, but it is not free as in freedom.
My recollection is that the TS license simply has protection against using the TS code to compete with TS, ala Amazon RDS.
While some people on HN feel that this is an impurity they can’t live with, I personally think it’s a small price to pay to enable development of TS to continue. In my opinion, claiming that it’s closed source is somewhat dogmatic. Many open source licenses have some kind of restrictions on use; the GPL comes to mind.
I have always been more on the pragmatic side of the FOSS movement (open source versus the philosophical/moral stance of the FSF, but still have tremendous respect for the FSF). For pragmatic reasons I reject the Timescale License. It doesn't just prevent Amazon (that alone would be misguided enough, though) but also prevent anyone other than Timescale from hosting the software for me. That means even if Timescales current offerings lined up perfectly with my business, I would be locked into whatever decisions they make in the future which may not align. The chances for a successful community fork are greatly reduced under the restrictions of the Timescale license. It makes it impossible for the community to make contributions on equal footing to Timescale, thus anyone making contributions are just doing free work for a corporation, rather than contributing to a product the community benefits from just as much as the corporation.
I prefer the Apache or Mozilla Public Licenses myself. But I accept the AGPL and the GPL family of licenses will consider useing software licensed under them in limited ways (it isn't clear how the AGPL interacts with infrastructure software).
"Open source" and "closed source" are not the only options. There are plenty of products out there where you're technically allowed to look at the source code, but very restricted in how you can legally use it. The "open" in "open source" is generally understood to mean that users have permission to use, modify and redistribute the software. (Without that permission, calling it "freeware", "shared source" or "source available" would be more accurate.)
In the case of the non-Apache-licensed version of TimescaleDB, you're allowed to use the software without payment, and you can distribute unmodified copies. But you're essentially forbidden from letting users define their own schemas, or from modifying it or reusing components unless your modified version imposes that same restriction. (The exception is if you agree to transfer ownership of your changes back to Timescale.)
Nobody's saying that Timescale can't build a non-open-source database, only that they should be clear about which parts are actually open. In my opinion, describing it on the homepage as an "open-source relational database" and then promoting it by benchmarking the proprietary version is at least a little bit misleading.
Sometimes when benchmarks are posted on HN that compares your own product (A) with a competitor's product (B), someone working for the competitor refutes the claims made in the blog and states that B were not given the same requirements as A. This particular thread comes to mind [1]. Maybe this is due to the blog writers misunderstanding B or maybe the competitor misunderstands A, or maybe there is a better way to do what was benchmarked using B. I think a better way would be to use some standard benchmark, and publish the code used for benchmarking them.
Easy fix - we can do a benchmark comparing TimescaleDB to pg_partman.
Longer reply: pg_partman does address many of the same developer experience items we do, but it doesn't offer things like compression or modified plans. It will perform roughly the same as declarative partitioning (because that's what it uses), and I'm guessing we will see results similar to the last large table (in the Declarative Partitioning section).
> For example, one recent query planner improvement excludes data more efficiently for relative now()-based queries (e.g., WHERE time >= now()-’1 week’::interval).
This could be done by just calculating the start date in code too.
> When hypertables are compressed the amount of data that queries need to read is reduced, leading to dramatic increases in performance of 1000x or more.
At my workplace we recently experimented with storing time series data in an array in a postgres row. This gets compressed as a TOAST array, can store thousands of ints in just a few DB pages (aka loading it is about the same as an index scan). We also use Timescale for a different service mind you. I'm sure this format is more efficient than the Timescale format too. In Timescale you would need rows containing (for example) (date, user_id, time_on_site), one row per day. The postgres array format (start_date, user_id, time_on_site_by_date) indexing where 0=start_date, 1=start_date+1 is like 1/3rd the size uncompressed. And yea, even if something is compressed, you still gotta put the uncompressed version in memory somewhere.
We are kindred spirits I think! I did this too [0] a while back at a previous company and it actually served as part of the inspiration for our compression work! It's fun, but a bit difficult to query at times. Our compressed columns do also get TOASTed and stored out of line.
I'm not sure that it's going to be much more efficient than the Timescale format once it's compressed, we have some pretty good compression algos, but I might be missing something about your case, we generally can achieve close to 10x compression, but right now you can't write directly compressed data, so you would save on the write side I suppose.
It is true that you need to put the uncompressed version into memory at some point, but we do try to limit that and in many cases you end up IO limited moreso than memory limited. We're also thinking about doing some work to push processing down towards the compressed data, but that's still in the "glint in our eye" stage, but I think it has a lot of promise.
(As a side note, TOAST is still the best acronym around ;) ).
Should be noted that article is titled Timescale + Postgres vs. Postgres alone. Timescale is built on Postgres so it's not really a competition per se.
This is a reasonable benchmark from the Clickhouse folks for single table anlaytical query performance over smallish data sets (10s of GB of data). Most of the DW vendors are on there.
As with anything, it depends on what you want to do.
If you have an OLAP heavy workload with long scans, etc (which is the type of queries prominent on the ClickHouse page - e.g., Q0 is "SELECT COUNT(*) FROM hits;"), then I would highly recommend systems other than Timescale. (Although we are also working on this ;-) )
But if you have time-series workload, or even, if you love Postgres and are building a time-series and/or analytical application, then I would recommend TimescaleDB.
ClickHouse is great. I just believe in using the right tool for the right job. :-) There are many areas where column store engines beat TimescaleDB. But nothing comes for free - everything has a tradeoff.
As a sidenote, Databend's elevator pitch sounds quite good: from what I've seen there's quite a lot of complexity in managing co-located storage and compute like ClickHouse, Druid and Pinot. Leveraging object storage instead seems like a good alternative.
Hey, about the column-wide data compression thing. If I have a column with strings that are highly similar to each other (maybe they are urls, maybe they are HTML documents that were scraped from the same site, but they are expected to contain redundancies)
What's the best approach to get excellent compression? Can I exploit this redundancy within the column somehow? I don't even know the right search terms for this.
TimescaleDB compression will work only for integer columns, right? And in any case, this is not a time series.
TimescaleDB supports compression for all data types, it just uses type-aware compression algorithms that it applies automatically/transparently based on typing.
- Gorilla compression for floats
- Delta-of-delta + Simple-8b with run-length encoding compression for timestamps and other integer-like types
- Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)
Timescale doesn't charge for any of its software. Revenue comes from providing hosting services that are optimized towards TimescaleDB and PostgreSQL at scale.
If integration with PostgreSQL is not a requirement, I would suggest looking at QuestDB[1] for the best performing open-source timeseries database currently on the market.
Are you sure it is the best performing? The article only compares to influx and timescale, which are very far from the fastest timeseries databases out there. How does it compare to clickhouse? I found this https://questdb.io/blog/2022/05/26/query-benchmark-questdb-v..., but the numbers are so far off (and clickhouse is much slower than timescale in their examples) that I kind of struggle to believe them.
It may be faster for specific benchmark cases, but in my experience Timescale has had wildly unpredictable performance when used in a production setting. It's also very annoying that you have to remember to run certain timescale functions for backups or restores and there's many more additional background processes running.
For example, they mention their automated data retention and how it's achieved with one SQL command, and how DELETEing records is a very costly operation, and how "even if you were using Postgres declarative partitioning you’d still need to automate the process yourself, wasting precious developer time, adding additional requirements, and implementing bespoke code that needs to be supported moving forward".
There's zero mention anywhere of pg_partman, which does all of these things for you equally as simply, and is a fully OSS free alternative [0].
I get that it's a PG extension that competes with their product. I know that TimescaleDB does a few other things that pg_partman does not. But I can't help but find its (seemingly) purposeful omission in these, otherwise very thorough blog posts, misleading.
[0] https://github.com/pgpartman/pg_partman/blob/master/doc/pg_p...