Hacker News new | past | comments | ask | show | jobs | submit login
Building a distributed time-series database on PostgreSQL (timescale.com)
383 points by craigkerstiens on Aug 21, 2019 | hide | past | favorite | 97 comments

The biggest limit is that their "chunking" of data by time-slices may lead directly to the hot partition problem -- in their case, a "hot chunk." Most time series is 'dull time' -- uninteresting time samples of normal stuff.

Then, out of nowhere, some 'interesting' stuff happens. It'll all be in that one chunk,which will get hammered during reads.

Like, imagine all the telemetry data and video that was taken during a single moon landing. Most of the data made into a time series is from the days in transit. 99% of it will be "uninteresting." But the moment Neil Armstrong puts his feet on the Moon surface, and the moments leading up to and subsequent of that event, will be the "hot chunk."

Advice: Take Zipfian distributions into account for data access.

(Disclosure: I work at ScyllaDB, which scales horizontally and vertically, and we work under various open-source time series databases like KairosDB and OpenNMS' Newts. Not trying to knock them, but hopefully save them from worlds of hurt found out the hard way.)

Blog post co-author and Timescale engineer here.

Thanks for the advice. FWIW, though, TimescaleDB supports multi-dimensional partitioning, so a specific "hot" time interval is actually typically split across many chunks, and thus server instances. We are also working on native chunk replication, which allows serving copies of the same chunk out of different server instances.

Apart from these things to mitigate the hot partition problem, it's usually a good thing to be able to serve the same data to many requests using a warm cache compared to having many random reads that thrashes the cache.

Hey Erik, thanks for the post. In this vision, would this cluster of servers be reserved exclusively for timeseries data, or do you imagine it containing other ordinary tables as well?

We're using postgres presently for some IoT, B2B applications, and the timeseries tables are a half dozen orders of magnitude larger than the other tables in our application. Certain database operations, like updates, take a very long time because of this. I've wondered if by splitting the timeseries tables onto their own server I could handle updates independently, with the main app gracefully handling the timeseries DB being offline for some period of time.

It's more than just about downtime though. If through poor querying or other issues the timeseries db is overloaded the customer impact of the slow down would be limited.

Hi @benwilson-512:

We commonly see hypertables (time-series tables) deployed alongside relational tables, often because there exists a relation between them: the relational metadata provides information about the user, sensor, server, security instrument that is referenced by id/name in the hypertable.

So joins between these time-series and relational tables are often common, and together these serve the applications one often builds on top of your data.

Now, TimescaleDB can be installed on a PG server that is also handling tables that have nothing to do with its workload, in which case one does get performance interference between the two workloads. We generally wouldn't recommend this for more production deployments, but the decision here is always a tradeoff between resource isolation and cost.

The link to join private beta isn't accessible, can you please look into it?

Sorry about that. We made a change that temporarily took it offline. Back now.

Hi Peter, as the blog post talks about, our distributed hypertables typically partition by both time _and_ "space" (i.e., some other column like device id, etc.) as a way to better parallelize I/O (reads & writes) for the "current" time. That is, each time slice is typically spread across all nodes that existed when the time interval was opened. So this greatly ameliorates the interesting "time" problem you mention.

Now, if this time/space partitioning alone isn't sufficient (i.e., demand for a single device/userid/etc at a specific time overcomes the read capacity of K nodes), having time-series data being primarily insert heavy (or even immutable) also gives us a lot of flexibility about how we replicate (as a sibling comment also suggested). And what really helps is that, by design, the architecture we built tracks fine-grained chunk information (rather than just course-grained hash-partitions), which can enable dynamic replication of individual chunks. More on this to come.

I was disappointed to see that Adaptive Chunking is deprecated[1]. Are there future plans to ~replace this functionality?

[1] https://docs.timescale.com/latest/api#set_adaptive_chunking

We deprecated Adaptive Chunking because we weren't thrilled with the way it was working. But yes we are looking into an improved way of solving this problem.

It would be great if you could share with us how this feature has been working out for you and how we can improve it in the future.

Naive question: if time-series data is presumably immutable, shouldn't it be easy to just arbitrarily replicate chunks proportionate to load?

There's a trade off here in that replicating data decreases the read load but increases the write load. If you have a chunk hot with writes, increasing the replication will make things worse, not better.

Right, but I'm assuming that with immutable data writes aren't a problem?

It depends on if by "immutable" you mean the only operation you are performing on the dataset are reads.

Writes is a catch-all term usually used to describe either updates or inserts. If you are inserting new data and a single chunk is hot because a you are inserting a lot of data into it, then replicating won't help. You can imagine a scenario like a single device is going haywire and starts sending you a ton of data points.

If you are only performing reads on your dataset, then replicating will only improve performance.

Yeah, but what about increasing the replication factor only for “hot” chunks?

"Hot" is lingo for describing a chunk that is being operated on at a rate much higher than other chunks. Depending on what exactly is making the chunk "hot" increasing replication can either make things better or worse.

If you have a chunk that's hot because there are a lot of reads going to it, yes, increasing replication will help because you are decreasing the amount of work you have to do per replica.

If you have a chunk that's hot because a lot of writes are going to it, increasing replication will make things worse as you are doing just as much work per replica as you were before, but you're now doing it on more replicas.

Does that make sense?

So what's the strategy for chunks that are "hot" with writes? Partitioning?

Referencing my copy of designing-data intensive applications[0], here are some approaches mentioned:

1) The naive approach is to assign all writes to a chunk randomly. This makes reads a lot more expensive as now a read for a particular key (e.g. device) will have to touch every chunk.

2) If you know a particular key is hot, you can spread writes for that particular key to random chunks. You need some extra bookeeping to keep track of which keys you are doing this for.

3) Splitting hot chunks into smaller chunks. You will wind up with varying sized chunks, but each chunk will now have a roughly equal write volume.

One more approach I would like to add is rate-limiting. If the reads or writes for a particular key crosses some threshold, you can drop any additional operations. Of course this is only fine if you are ok with having operations to hot keys often fail.

[0] https://www.amazon.com/Designing-Data-Intensive-Applications...

I can only recommend TimescaleDB. It solves the right problems (storing timeseries) while not creating new ones (deployment, backup, hot failover) as it relies on Postgres to provide the underlying infrastructure. I stored 100 million sensor samples in TimeScale and had not issues with scaling on medium sized boxes, despite issuing complex time-series queries.

As for the hosting option, currently sadly AWS doesn’t offer Timescale as part of RDS. There are two options: Azure offers Timescale now as part of their hosted Postgres. Or you go with aiven.io who can host you postgres with TimeScaleDB on all cloud providers (AWS, GCP, Azure, DO, ?) as a service, including replicas and backups.

Overall, I’m very happy to see the Postgres ecosystem growing.

If you are looking for hosted TimescaleDB, eg on AWS, you may want to check out Timescale Cloud [1], which we launched a couple months ago.

Fully-managed TimescaleDB, including community and enterprise capabilities, high-availability, etc, available on AWS, GCP, and Azure.

There are other options as well (Azure, DigitalOcean, Alibaba, Aiven) but they only offer the OSS version of TimescaleDB.

More here (including a pricing calculator): https://www.timescale.com/cloud

[1] https://blog.timescale.com/blog/timescale-cloud-first-fully-...

Thanks for the recommendation! And glad to hear TimescaleDB meets your needs. I wanted to add some color to hosting / deployment options around TimescaleDB.

Options are as follows:

[1] Fully-Managed TimescaleDB Enterprise on Timescale Cloud

TimescaleDB Open Source hosted on Public Clouds:

[2] Azure PostgreSQL

[3] DO Managed-Postgres (mentioned above)

[4] Alibaba Cloud

Not hosted, but managed service options:

[5] ClusterControl from our friends at Severalnines


[1]https://www.timescale.com/cloud [2]https://azure.microsoft.com/en-us/blog/power-iot-and-time-se... [3]https://www.digitalocean.com/docs/databases/postgresql/resou... [4]https://www.alibabacloud.com/blog/sql-and-timescaledb_595169 [5]https://severalnines.com/blog/advanced-database-monitoring-m...

Interesting. My team currently uses (abuses?) postgres for timeseries data. You mind ansswering some general questions about your experience with timescale? You said 100 million sensor samples. What was the upload/download frequency? Our application is pushing hundreds of millions of rows across many different data sources every day. On top of that, we are also querying the shit out of this data to run models and we need VERY quick queries. like 10-100ms speed.

How do you think timescaleDB would handle that size and also velocity of data?

TimescaleDB could fit your workload if PostgreSQL fits you. The main issue with PostgreSQL and TimescaleDB is big amounts of storage space required for huge time series data volumes. There are reports that storing data on ZSF can reduce the required storage space.

Probably, ClickHouse [1] would fit better your needs. It can write millions of rows per second [2]. It can scan billions of rows per second on a single node and it scales to multiple nodes.

Also I'd recommend taking a look at other open-source TSDBs with cluster support:

- M3DB [3]

- Cortex [4]

- VictoriaMetrics [5]

These TSDBs speak PromQL instead of SQL. PromQL is specially optimized query language for typical time series queries [6].

[1] https://clickhouse.yandex

[2] https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

[3] https://www.m3db.io/

[4] https://github.com/cortexproject/cortex

[5] https://github.com/VictoriaMetrics/VictoriaMetrics/

[6] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

We are actively working on native compression in TimescaleDB, with really promising early results. More to come.

actually the last time I checked/used it timescaledb did use parallel queries and had a good partinoning so that the parallel queries would even scale really good on a single node.

DigitalOcean Managed Databases also comes with TimescaleDB built-in: https://www.digitalocean.com/docs/databases/postgresql/resou...

I was going to post a very similar comment :)

For on-prem deployments I'm using TimescaleDB on a single node with up to 100 million events, and for SaaS I'm using it on Azure, and the performance is little short of amazing!

Something I particularly like is the almost instantaneous deletes (because of the "chunking" model) - perfect for data retention jobs that delete old data.

I use RiakTS although unfortunately with Basho's demise it's future is in a bit of jeopardy.

We at VictoriaMetrics recognized importance of splitting up storage and query nodes as well. We went even further -- separated insert nodes from storage nodes. So for cluster version we have 3 types of nodes:

  * vminsert (stateless)
  * vmselect (stateless)
  * vmstorage (stateful)
However, we found out that PostgreSQL storage layer takes incredibly huge amount of space -- 28 bytes/metrics versus 0.4 b/m with VictoriaMetrics (70x difference!) for typical real-world data. That's why we didn't consider PostgreSQL for our storage layer, which otherwise could be awesome.

(see Disk Usage benchmark graph at [1])

That also hurts not only storage, but performance, as queries bottleneck becomes disk IO, check out this benchmark we conducted with TimescaleDB v1.2.2: [2]

Good job on going multi-node in v2! Can't wait to benchmark it with VM cluster version :)

[1] https://medium.com/@valyala/measuring-vertical-scalability-f...

[2] https://medium.com/@valyala/high-cardinality-tsdb-benchmarks...

Hi @dima_vm, we've found that users have really embraced the full SQL and reliability you get from TimescaleDB's approach leveraging PostgreSQL. But we're aware that its standard on-disk format can be more space intensive than others (although many do deploy with ZFS to trade-off some CPU for I/O).

Recognizing this, the engineering team has been hard at work bringing native compression to TimescaleDB, which is also in private beta right now.

Huge wins, but more details & performance numbers in a future blog post =)

Still waiting for AWS RDS for PostgresSQL to support the TimescaleDB extension, 2 years and counting:


Probably not gonna happen as AWS wants to sell its own solution.

When though? AWS Timestream was discussed on HN 8 months ago[1]. I had registered for the preview and still no access or even a response from AWS.

[1] https://news.ycombinator.com/item?id=18553336

RDS is always behind when it comes to their extensions, especially for PostgreSQL.

I just came here to tell how happy I am with Timescaledb.

I have almost 8 tables with over 60Million rows and I'm very happy with the performance. Considering I have a t2.medium instance(2 CPUs with 4GB RAM). Like, everyone else mentioned, having AWS hosted option will be awesome!

Is there a way to optimize storage? I have set chunk size to 1 day interval . About 2 million rows per day writes.

My many thanks to the engineering team.

Re: optimizing storage, we are working on bringing native compression to TimescaleDB. So far the results are really promising. If you're interested in testing out an early version feel free to reach out - ajay (at) timescale.com.

Are you afraid the Postgres tuple-at-a-time iterator architecture is going to be limiting for your long term performance optimizations?

Presumably the custom operators you’ve implemented in your query plan can push projection and predicates down to a more efficient architecture without so many indirect function calls / branches / etc, but once you get up past that, aren’t you back in iterator land?

I'm guessing the use cases they see don't involve query shapes that need to scan a lot of data. A columnstore along with vectorized or batched execution is a much better storage layout for fast scans with filters and aggregations (i.e., the top TPC-H and TPC-DS analytics benchmark results are from columnstore databases). A B-tree with tuple at a time execution is orders of magnitude slower [1][2].

[1] http://cidrdb.org/cidr2005/papers/P19.pdf [2] http://db.csail.mit.edu/projects/cstore/vldb.pdf

They probably don't care, otherwise they wouldn't have gone with PostgreSQL in the first place.

We actually haven't been running against any limits here. One thing to keep in mind is that postgres remote-fetch operations aren't tuple-at-a-time, so this shouldn't be a bottleneck for our multi-node operations.

Have you done any analysis of your per-core scan rates for simple aggregations like sum/count + group by with a reasonably large cardinality key? Or has anyone published a benchmark you trust on queries of that variety?

An example would be TPC-H Q1, which is a little weak on the group by cardinality, but is good for testing raw aggregation performance.

We actually have done fairly extensive benchmarking of high cardinality data on our single-node product (we have a blog entry detailing at least our insert performance here: https://blog.timescale.com/blog/what-is-high-cardinality-how...)

We're actually currently focused on query optimization for our multi-node product, but we don't have any numbers we're currently ready to share.

Your time column in your hot chunck (e.g. current day/hour for trading) is going to basically bang on the time column for every query and partitioning isn't going to help much entire (probably hurt on writes) - other ts databases will part it out after period (e.g, end of day roll). how do you deal with this?

also, while you can make columnar data, sql lacks a rich enough language to take advantage of it. your advances queries seems like they aren't very good at exploiting the layout and you need to be specially written into the db (you cannot make your own high performance queries easily). I've never seen a decent LEAD/LAG query perform well, and they are too simplistic. I think you are fighting a losing war if you are trying to optimize sql down to good array based access.

A good tsdb isn't just changing the storage layer. Performance is also heavilty influenced by how queries are able to be expressed.

Hi @jnordwick: I talk about this in another response and in the parent article.

Individual time intervals are also spread across the cluster. So if you are collecting data from a lot of, say, sensors, servers, or financial instruments, then reads/writes for the same time interval are then parallelized across servers.


I saw, my understanding is that you basically have a sort on device, then time: this helps some for individual queries devices to some extend (but prob hurts when you are inserts at 500 places instead of 1 or when you have queries that span too many instruments).

Point wast (and the others's i think) was the you often have as very hot segment and yesterday's data is only used at night for example. and you can have a hot device (eg, top 10 symbols). the parting doesn't help a lot there until you can spread the time around and rejoin (netezza used to do something similar and it wasn't very good at it). Do you ever rebalance the partitions? getting you top 10 symbols accidentally stuck on the same partition would be painful especially without a way to control it.

splaying the record column-wise helps in this, but i'm not sure if you are doing this.

It's not hierarchical as you describe. You don't first partition on device, then on time. They are done simultaneously -- see this older post for an illustration/comparison: https://blog.timescale.com/blog/time-series-data-postgresql-...

So this architecture fully allows various striping or distribution options across time and space, even though the default might collocate chunks belonging to the same device on the same machine (at least since the last elasticity event).

I agree with your points. What alternatives do you suggest?

Timescale has improved greatly since first released and is pretty solid on a single-node.

Wish they would tone down the hype in the blog posts though, a shard/chunk/partition are all the same. How you define the splits is completely arbitrary and every database uses its own algorithm, including multiple levels.

Hey @manigandham thanks for the complements on database overall =)

I understand conceptually that this is all about splitting data, but I think if you look at most scalable databases that use sharding, it’s really meant as a partitioning of primary keyspace over servers, and then you just globally map this sharding through client libraries, some transparent proxy, or some map that every node maintains, because O(map) = O(# servers). Examples: Cassandra, DynamoDB, scale-out memcached, Vitesse, ZippyDB/RocksDB, etc.

We are instead tracking per-chunk state in catalogs to give us this level of flexibility, and allowing the movement/migration of individual chunks on a much finer-grained basis. This is both for placement/management across the cluster but also for data management on single nodes, e.g., for data retention policies, tiering, lazy indexing, etc.

I realize this isn’t a hard-and-fast rule, and exceptions always exist. But one reason we try to call this out is we’re often asked why we don’t just use a standard hash-based partitioning tool/system as a black box, which wouldn’t give us this level of fine-grained visibility & control that we find highly useful for time-series data management.

[Timescale co-founder & post co-author]

Hey Michael,

I get it, you guys are using the primary keys for data => chunk and a second level for chunk => server/node. Other databases do this as well to abstract physical and logical partition placement.

Anyways, nice to see the SQL interface and AN/DN role implementation. Easier and more usable overall compared to some other solutions like Citus.

Wow. For me personally, that sounds like Distributed Event Sourcing Storage at scale.

I don't know if anybody observed but the article is so damn intuitive, it literally covered almost all the things. Often times when such articles are published I have to google it deeper to get a sense of its practical use.

I have one query: How efficient is the deletion (from disk) of chunks in a new distributed model?

It basically boils down to deleting a bunch of files on disk. The fact that it is distributed doesn't affect efficiency too much; it is basically a delete sent to all nodes, followed by a two-phase commit.

The upside of deleting entire tables (chunks) like this is that you don't pay the same PostgreSQL vacuuming cost normally associated with row-by-row deletes.


Really nice that this is out, I've been following the github issues related to this for a long time and I think it might be time to start that meshed sensor network I was thinking of building with timescale

Glad to hear that :) Please let us know how we can help. We also have an active Slack community [1] if you want to chat with others who are storing sensor data in TimescaleDB.

[1] https://slack.timescale.com

TimescaleDB looks really very promising but this is a red flag:

"Hypertables support all standard PostgreSQL constraint types, with the exception of foreign key constraints on other tables that reference values in a hypertable"[1]

Naively I'd assume this could cause a two-colouring of your schema - the partition that can use referential integrity and another with hypertables that doesn't which feels like a pretty big trade-off.

[1] https://docs.timescale.com/latest/using-timescaledb/schema-m...

In practice this doesn't come up a lot. Say you have a hypertable with measurement(time, device_id, value) and a device table with (device_id, device_manufacturer, device_type). Timescale fully support a foreign-key from the measurement table into the devices table. This is a common usage. A FK from another table which references a measurement row is not supported, but is also uncommon. To see why note that a part of the primary-key of the measurement table is time and so conceptually the only type of table that would want a FK into it is also a time-based table, and so the only real usage is a 1-to-1 relation. That is also uncommon and can be gotten-around with normalization.

> Building a distributed time-series database on PostgreSQL

Next order of business: Making mud pies.

PostgreSQL is geared towards transactional work. With time series, you basically just append data occasionally, and do analytics. PostgreSQL is terrible for analytics - its architecture is all wrong. 2 or 3 orders of magnitude slower than the state of the art if not more.

Which databases are good for analytics from your point of view?

In my experience, being able to do advanced ad-hoc SQL queries is priceless for analytics. Timescale helps in scaling time series use-cases that used to scale badly in plain PostgreSQL.

There are other relational databases like MemSQL or Clickhouse that use distributed column-oriented architectures that are much better at large scale analytics and aggregations.

Postgres is getting pluggable storage engines in the next version (and already has foreign data wrappers) so that can at least lead to a better storage design.

It's not just my point of view - it's well known in the research community, and has been for decades.

For FOSS, have a look at MonetDB. For research-oriented systems, look for publications regarding HyperDB or VectorWise/Actian Vector (VectorH in the cluster version). Other commercial offerings are Vertica (formerly C-Store) and SAP Hana.

PostgreSQL is not even something anyone compares against in analytics...

Oh yea, MemSQL and ClickHouse are also indeed relevant and in this category, except that ClickHouse doesn't support all of SQL and any table structure, so it's not a full-fledged DBMS.

If you take a look at any of our benchmarks, you’ll see that this is not the case. PostgreSQL in fact can scale quite well for time-series analytics, if architected correctly.

But why don’t you just try out TimescaleDB and see for yourself?

Please link to those benchmarks, and we'll see. Also, a link to the relevant SIGMOD/VLDB/ICDE/DaMoN/ADMS/etc. submission arguing in favor of TimeScaleDB's design would also be appreciated.

On the linked-to article I only see references to irrelevant transactional DBMSes...

Yes, it's just like I thought. You're comparing against transaction-oriented DBMSes, or ones which handle documents rather than tabular data (and hence slow on tabular data).

One possible exception is InfluxDB - I'm not familiar enough with it.

Anyway, try running TSBS on columnar DBMSes like Actian VectorH, Vertica, SAP HANA etc. ClickHouse may also be relevant; they don't support any possible schema, but it may be enough to run TSBS.

We're happy to take pull requests for new databases, we have so far from Clickhouse, CrateDB, and SiriDB (and one pending). We've tried to make it relatively easy for new databases to hook in.

We usually implement ones that we hear about a lot from customers, and so far those haven't come up a ton. We'll keep it in mind though as we look to keep adding new ones.

At the end of 2018 Altinity benchmarked ClickHouse against the TSBS and documented it.


Thank you very much for your valuable input - it is very important that people understand the differences and look into this!

Performance comparisons to the candidates you named would be very interesting to see.

Downvoters: you should be happy that people with more knowledge than the average javascript-aws-webdevops-guy that is needed to operate a startup invest time to inform you about alternatives you might not know about.

Also it is important to keep this site attractive to people that have a different opinions and experiences - do not do that trump thing! Thanks!

Of course, for each claim replicable facts are needed.

I am curious about that too. As a separate topic, if the operational dbs can be compatible with parquet type storage (backup and restore), the offline analytics and machine learning would be seamlessly integrated together. Offline analytics usually can simplify online analytics. Discovering new dimensions, normalization/denormalization, and optimization of indices and partitions. Operational dbs shouldn't have to stress themselves at the gunpoint.

I think the ask was for comparisons against traditional analytics databases (redshift, Vertica, etc.). Columnstores are substantially faster for table scans + aggregations then rowstores (and they use a lot less storage) [1].

[1] http://db.csail.mit.edu/projects/cstore/vldb.pdf

Anyone have examples of using this for financial / trading algorithm based needs? I've been investigating solutions for a while now and haven't had much luck on a winner.

Is it better to do aggregations with the DB or through some MapReduce method (Google Dataflow?) and write that to a DB?

I believe we have some in our Slack channel [0].

Also one of our investors is Two Sigma so this is an area of interest to us.

If you're open to it, I'd love to learn more about your specific use case. Want to chat sometime? ajay (at) timescale.com

[0] https://slack.timescale.com

Does anyone have experience how this compares with citusdb (also postgres)?

We talk about sharding vs. chunking in the blog post and I would put CitusDB in the former category. More specifically, TimescaleDB is focusing on time-series workloads. To handle time-series workloads, CitusDB suggests combining their extension with a third-party extension (pg_partman) (see their docs).

I have no experience with this combination myself, so don't want to speculate about performance, etc., but when reading the docs it really seems like an afterthought.

What are some use cases for a time-series database?

The big use cases are devops/monitoring data and IoT data. But it also applies to pretty much any use case where you want to answer questions about data in relation to time.

[Timescaledb engineer here] We like to say that time-series data is any data that is insert-mostly with data associated with the most recent time period. That's a pretty broad definition, intentionally so. We see usage in telecoms, heavy industry, science, health, IoT, etc. It's really about recording the history of your data as it evolves, instead of just the current state.

I've been eyeing TimescaleDB at a distance for some time now. I'm curious if you have seen it used in finance as an alternative to KDB+ installations anywhere?

Have you thought to release any benchmarks against KDB+?

To my understanding, KDB+'s license explicitly forbids benchmarking:

"1.3 Kdb+ On Demand Software Performance. End User shall not distribute or otherwise make available to any third party any report regarding the performance of the Kdb+ On Demand Software, Kdb+ On Demand Software benchmarks or any information from such a report unless End User receives the express, prior written consent of Kx to disseminate such report or information."

That's a shame. I'm not sure what the rationale is for such clauses - especially where performance of your tech is known to be pretty good (as is the case for KDB+).

There's definitely a huge opportunity to displace KDB+ as the mainstay for timeseries in capital markets. It is a premium product, but it obviously comes with a cost - both for the product and its operators.

Assuming all use cases can be catered for, if one needs an extra N machines if using TimescaleDB to cater for the same workloads, it might nullify any savings. If only there were a way to understand that without breaking their EULA...

Does this also apply to a locally installed version? It didn't in the past.

"On Demand" is IIRC the cloud kdb+, which is thus much less predictable and easy to misrepresent.

Not my specialty, but website shows two versions available for download (at least for free non-commercial use): 64-bit "on-demand" or 32-bit. Both have a similar no benchmarking clause.


32-bit version: "(c) 32 Bit Kdb+ Software Evaluations. User shall not distribute or otherwise make available to any third party any report regarding the performance of the 32 Bit Kdb+ Software, 32 Bit Kdb+ Software benchmarks or any information from such a report unless User receives the express prior written consent of Kx to disseminate such report or information."

But more broadly, feedback from the finance/capital markets suggest that the choice of kdb's proprietary Q query language vs. standard SQL is top-of-mind, expanding access and insights to time-series data from a small set of highly-specialized engineers to any of their developers / analysts / tools.

Thanks. Interesting; I would be understanding if this was related to the “free” version, and the “bought and paid for” did not have such a restriction - but it might anyway.

I have used kdb in the past, and it is friendly in the Unix sense (picky about who it makes friends with) - first time users often write queries that use slow scalar loops.

Regardless, thanks; i’ll Be looking closely at timescaledb

How does it compare with using temporal tables in a relational database?

Temporal table = state of the data in the table at a particular time. Very useful for auditing or seeing how data has changed between different periods.

Timeseries = data with a primary key that includes time, potentially with other time properties. For example, metrics are commonly associated with a value at some time.

Well as far as I understand, temporal tables usually have either a valid time or system time. TimescaleDB is geared towards something you'd call measurement time, and most modifications are inserts to recent measurement time. In contrast, temporal tables are often still update-heavy and there is often no correlation with recent time, especially for valid time fields.

Pretty much anything that is naturally generating time series data, and when your queries are going to be about things-over-time in relationship to each other.

Think automotive monitoring (external or internal), algorithmic trading, retail monitoring, aviation, etc.

They're commonly used for collecting metrics, such as performance or traffic data.

the operational historian parts of SCADA systems

I'm currently using influxdb v1.x and I'm not very happy with it for many reasons (impossible to delete a value, no clustering in free version,...). Can anyone who migrated from influxdb to timescale share his opinion ?

I'll let the community talk more about their own experiences, but we built some easy tools to enable this migration:

Outflux (snapshot migration): https://www.outfluxdata.com/

Telegraph (streaming migration): https://blog.timescale.com/blog/introducing-the-postgresql-t...

Both of these tools will perform automatic schema generation in TimescaleDB, which greatly simplifies the migration.

There is no clustering in timescale either. One of the reasons I stopped exploring this option.

@dominotw: See the parent article =)

Is aggregation working? Last I checked it can only aggregate old values into a different table. Which in turn makes visualising them painful.

If you need to scale bigger, check out Interana (I am an engineer at Interana). We've created a time series query engine & application. We have clusters with over a trillion events.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact