
Building a distributed time-series database on PostgreSQL - craigkerstiens
https://blog.timescale.com/blog/building-a-distributed-time-series-database-on-postgresql/
======
PeterCorless
The biggest limit is that their "chunking" of data by time-slices may lead
directly to the hot partition problem -- in their case, a "hot chunk." Most
time series is 'dull time' \-- uninteresting time samples of normal stuff.

Then, out of nowhere, some 'interesting' stuff happens. It'll all be in that
one chunk,which will get hammered during reads.

Like, imagine all the telemetry data and video that was taken during a single
moon landing. Most of the data made into a time series is from the days in
transit. 99% of it will be "uninteresting." But the moment Neil Armstrong puts
his feet on the Moon surface, and the moments leading up to and subsequent of
that event, will be the "hot chunk."

Advice: Take Zipfian distributions into account for data access.

(Disclosure: I work at ScyllaDB, which scales horizontally and vertically, and
we work under various open-source time series databases like KairosDB and
OpenNMS' Newts. Not trying to knock them, but hopefully save them from worlds
of hurt found out the hard way.)

~~~
enordstr
Blog post co-author and Timescale engineer here.

Thanks for the advice. FWIW, though, TimescaleDB supports multi-dimensional
partitioning, so a specific "hot" time interval is actually typically split
across many chunks, and thus server instances. We are also working on native
chunk replication, which allows serving copies of the same chunk out of
different server instances.

Apart from these things to mitigate the hot partition problem, it's usually a
good thing to be able to serve the same data to many requests using a warm
cache compared to having many random reads that thrashes the cache.

~~~
benwilson-512
Hey Erik, thanks for the post. In this vision, would this cluster of servers
be reserved exclusively for timeseries data, or do you imagine it containing
other ordinary tables as well?

We're using postgres presently for some IoT, B2B applications, and the
timeseries tables are a half dozen orders of magnitude larger than the other
tables in our application. Certain database operations, like updates, take a
very long time because of this. I've wondered if by splitting the timeseries
tables onto their own server I could handle updates independently, with the
main app gracefully handling the timeseries DB being offline for some period
of time.

It's more than just about downtime though. If through poor querying or other
issues the timeseries db is overloaded the customer impact of the slow down
would be limited.

~~~
mfreed
Hi @benwilson-512:

We commonly see hypertables (time-series tables) deployed alongside relational
tables, often because there exists a relation between them: the relational
metadata provides information about the user, sensor, server, security
instrument that is referenced by id/name in the hypertable.

So joins between these time-series and relational tables are often common, and
together these serve the applications one often builds on top of your data.

Now, TimescaleDB can be installed on a PG server that is also handling tables
that have nothing to do with its workload, in which case one does get
performance interference between the two workloads. We generally wouldn't
recommend this for more production deployments, but the decision here is
always a tradeoff between resource isolation and cost.

------
theomega
I can only recommend TimescaleDB. It solves the right problems (storing
timeseries) while not creating new ones (deployment, backup, hot failover) as
it relies on Postgres to provide the underlying infrastructure. I stored 100
million sensor samples in TimeScale and had not issues with scaling on medium
sized boxes, despite issuing complex time-series queries.

As for the hosting option, currently sadly AWS doesn’t offer Timescale as part
of RDS. There are two options: Azure offers Timescale now as part of their
hosted Postgres. Or you go with aiven.io who can host you postgres with
TimeScaleDB on all cloud providers (AWS, GCP, Azure, DO, ?) as a service,
including replicas and backups.

Overall, I’m very happy to see the Postgres ecosystem growing.

~~~
brootstrap
Interesting. My team currently uses (abuses?) postgres for timeseries data.
You mind ansswering some general questions about your experience with
timescale? You said 100 million sensor samples. What was the upload/download
frequency? Our application is pushing hundreds of millions of rows across many
different data sources every day. On top of that, we are also querying the
shit out of this data to run models and we need VERY quick queries. like
10-100ms speed.

How do you think timescaleDB would handle that size and also velocity of data?

~~~
valyala
TimescaleDB could fit your workload if PostgreSQL fits you. The main issue
with PostgreSQL and TimescaleDB is big amounts of storage space required for
huge time series data volumes. There are reports that storing data on ZSF can
reduce the required storage space.

Probably, ClickHouse [1] would fit better your needs. It can write millions of
rows per second [2]. It can scan billions of rows per second on a single node
and it scales to multiple nodes.

Also I'd recommend taking a look at other open-source TSDBs with cluster
support:

\- M3DB [3]

\- Cortex [4]

\- VictoriaMetrics [5]

These TSDBs speak PromQL instead of SQL. PromQL is specially optimized query
language for typical time series queries [6].

[1] [https://clickhouse.yandex](https://clickhouse.yandex)

[2] [https://blog.cloudflare.com/http-analytics-
for-6m-requests-p...](https://blog.cloudflare.com/http-analytics-
for-6m-requests-per-second-using-clickhouse/)

[3] [https://www.m3db.io/](https://www.m3db.io/)

[4]
[https://github.com/cortexproject/cortex](https://github.com/cortexproject/cortex)

[5]
[https://github.com/VictoriaMetrics/VictoriaMetrics/](https://github.com/VictoriaMetrics/VictoriaMetrics/)

[6] [https://medium.com/@valyala/promql-tutorial-for-
beginners-9a...](https://medium.com/@valyala/promql-tutorial-for-
beginners-9ab455142085)

~~~
akulkarni
We are actively working on native compression in TimescaleDB, with really
promising early results. More to come.

------
dima_vm
We at VictoriaMetrics recognized importance of splitting up storage and query
nodes as well. We went even further -- separated insert nodes from storage
nodes. So for cluster version we have 3 types of nodes:

    
    
      * vminsert (stateless)
      * vmselect (stateless)
      * vmstorage (stateful)
    

However, we found out that PostgreSQL storage layer takes incredibly huge
amount of space -- 28 bytes/metrics versus 0.4 b/m with VictoriaMetrics (70x
difference!) for typical real-world data. That's why we didn't consider
PostgreSQL for our storage layer, which otherwise could be awesome.

(see Disk Usage benchmark graph at [1])

That also hurts not only storage, but performance, as queries bottleneck
becomes disk IO, check out this benchmark we conducted with TimescaleDB
v1.2.2: [2]

Good job on going multi-node in v2! Can't wait to benchmark it with VM cluster
version :)

[1] [https://medium.com/@valyala/measuring-vertical-
scalability-f...](https://medium.com/@valyala/measuring-vertical-scalability-
for-time-series-databases-in-google-cloud-92550d78d8ae)

[2] [https://medium.com/@valyala/high-cardinality-tsdb-
benchmarks...](https://medium.com/@valyala/high-cardinality-tsdb-benchmarks-
victoriametrics-vs-timescaledb-vs-influxdb-13e6ee64dd6b)

~~~
mfreed
Hi @dima_vm, we've found that users have really embraced the full SQL and
reliability you get from TimescaleDB's approach leveraging PostgreSQL. But
we're aware that its standard on-disk format can be more space intensive than
others (although many do deploy with ZFS to trade-off some CPU for I/O).

Recognizing this, the engineering team has been hard at work bringing native
compression to TimescaleDB, which is also in private beta right now.

Huge wins, but more details & performance numbers in a future blog post =)

------
hbcondo714
Still waiting for AWS RDS for PostgresSQL to support the TimescaleDB
extension, 2 years and counting:

[https://github.com/timescale/timescaledb/issues/65](https://github.com/timescale/timescaledb/issues/65)

~~~
aeyes
Probably not gonna happen as AWS wants to sell its own solution.

~~~
hbcondo714
When though? AWS Timestream was discussed on HN 8 months ago[1]. I had
registered for the preview and still no access or even a response from AWS.

[1]
[https://news.ycombinator.com/item?id=18553336](https://news.ycombinator.com/item?id=18553336)

------
120bits
I just came here to tell how happy I am with Timescaledb.

I have almost 8 tables with over 60Million rows and I'm very happy with the
performance. Considering I have a t2.medium instance(2 CPUs with 4GB RAM).
Like, everyone else mentioned, having AWS hosted option will be awesome!

Is there a way to optimize storage? I have set chunk size to 1 day interval .
About 2 million rows per day writes.

My many thanks to the engineering team.

~~~
akulkarni
Re: optimizing storage, we are working on bringing native compression to
TimescaleDB. So far the results are really promising. If you're interested in
testing out an early version feel free to reach out - ajay (at) timescale.com.

------
ryanworl
Are you afraid the Postgres tuple-at-a-time iterator architecture is going to
be limiting for your long term performance optimizations?

Presumably the custom operators you’ve implemented in your query plan can push
projection and predicates down to a more efficient architecture without so
many indirect function calls / branches / etc, but once you get up past that,
aren’t you back in iterator land?

~~~
WireBaron
We actually haven't been running against any limits here. One thing to keep in
mind is that postgres remote-fetch operations aren't tuple-at-a-time, so this
shouldn't be a bottleneck for our multi-node operations.

~~~
ryanworl
Have you done any analysis of your per-core scan rates for simple aggregations
like sum/count + group by with a reasonably large cardinality key? Or has
anyone published a benchmark you trust on queries of that variety?

An example would be TPC-H Q1, which is a little weak on the group by
cardinality, but is good for testing raw aggregation performance.

~~~
WireBaron
We actually have done fairly extensive benchmarking of high cardinality data
on our single-node product (we have a blog entry detailing at least our insert
performance here: [https://blog.timescale.com/blog/what-is-high-cardinality-
how...](https://blog.timescale.com/blog/what-is-high-cardinality-how-do-time-
series-databases-influxdb-timescaledb-compare/))

We're actually currently focused on query optimization for our multi-node
product, but we don't have any numbers we're currently ready to share.

------
jnordwick
Your time column in your hot chunck (e.g. current day/hour for trading) is
going to basically bang on the time column for every query and partitioning
isn't going to help much entire (probably hurt on writes) - other ts databases
will part it out after period (e.g, end of day roll). how do you deal with
this?

also, while you can make columnar data, sql lacks a rich enough language to
take advantage of it. your advances queries seems like they aren't very good
at exploiting the layout and you need to be specially written into the db (you
cannot make your own high performance queries easily). I've never seen a
decent LEAD/LAG query perform well, and they are too simplistic. I think you
are fighting a losing war if you are trying to optimize sql down to good array
based access.

A good tsdb isn't just changing the storage layer. Performance is also
heavilty influenced by how queries are able to be expressed.

~~~
mfreed
Hi @jnordwick: I talk about this in another response and in the parent
article.

Individual time intervals are also spread across the cluster. So if you are
collecting data from a lot of, say, sensors, servers, or financial
instruments, then reads/writes _for the same time interval_ are then
parallelized across servers.

[https://blog.timescale.com/blog/building-a-distributed-
time-...](https://blog.timescale.com/blog/building-a-distributed-time-series-
database-on-postgresql#architecture-access-nodes-and-data-nodes)

~~~
jnordwick
I saw, my understanding is that you basically have a sort on device, then
time: this helps some for individual queries devices to some extend (but prob
hurts when you are inserts at 500 places instead of 1 or when you have queries
that span too many instruments).

Point wast (and the others's i think) was the you often have as very hot
segment and yesterday's data is only used at night for example. and you can
have a hot device (eg, top 10 symbols). the parting doesn't help a lot there
until you can spread the time around and rejoin (netezza used to do something
similar and it wasn't very good at it). Do you ever rebalance the partitions?
getting you top 10 symbols accidentally stuck on the same partition would be
painful especially without a way to control it.

splaying the record column-wise helps in this, but i'm not sure if you are
doing this.

~~~
mfreed
It's not hierarchical as you describe. You don't first partition on device,
then on time. They are done simultaneously -- see this older post for an
illustration/comparison: [https://blog.timescale.com/blog/time-series-data-
postgresql-...](https://blog.timescale.com/blog/time-series-data-
postgresql-10-vs-timescaledb-816ee808bac5/#postgresql-10-partitioning)

So this architecture fully allows various striping or distribution options
across time and space, even though the default might collocate chunks
belonging to the same device on the same machine (at least since the last
elasticity event).

------
manigandham
Timescale has improved greatly since first released and is pretty solid on a
single-node.

Wish they would tone down the hype in the blog posts though, a
shard/chunk/partition are all the same. How you define the splits is
completely arbitrary and every database uses its own algorithm, including
multiple levels.

~~~
mfreed
Hey @manigandham thanks for the complements on database overall =)

I understand conceptually that this is all about splitting data, but I think
if you look at most scalable databases that use sharding, it’s really meant as
a partitioning of primary keyspace over servers, and then you just globally
map this sharding through client libraries, some transparent proxy, or some
map that every node maintains, because O(map) = O(# servers). Examples:
Cassandra, DynamoDB, scale-out memcached, Vitesse, ZippyDB/RocksDB, etc.

We are instead tracking per-chunk state in catalogs to give us this level of
flexibility, and allowing the movement/migration of individual chunks on a
much finer-grained basis. This is both for placement/management _across_ the
cluster but also for data management on single nodes, e.g., for data retention
policies, tiering, lazy indexing, etc.

I realize this isn’t a hard-and-fast rule, and exceptions always exist. But
one reason we try to call this out is we’re often asked why we don’t just use
a standard hash-based partitioning tool/system as a black box, which wouldn’t
give us this level of fine-grained visibility & control that we find highly
useful for time-series data management.

[Timescale co-founder & post co-author]

~~~
manigandham
Hey Michael,

I get it, you guys are using the primary keys for data => chunk and a second
level for chunk => server/node. Other databases do this as well to abstract
physical and logical partition placement.

Anyways, nice to see the SQL interface and AN/DN role implementation. Easier
and more usable overall compared to some other solutions like Citus.

------
truth_seeker
Wow. For me personally, that sounds like Distributed Event Sourcing Storage at
scale.

I don't know if anybody observed but the article is so damn intuitive, it
literally covered almost all the things. Often times when such articles are
published I have to google it deeper to get a sense of its practical use.

I have one query: How efficient is the deletion (from disk) of chunks in a new
distributed model?

~~~
enordstr
It basically boils down to deleting a bunch of files on disk. The fact that it
is distributed doesn't affect efficiency too much; it is basically a delete
sent to all nodes, followed by a two-phase commit.

The upside of deleting entire tables (chunks) like this is that you don't pay
the same PostgreSQL vacuuming cost normally associated with row-by-row
deletes.

~~~
truth_seeker
Thanks.

------
ttsda
Really nice that this is out, I've been following the github issues related to
this for a long time and I think it might be time to start that meshed sensor
network I was thinking of building with timescale

~~~
akulkarni
Glad to hear that :) Please let us know how we can help. We also have an
active Slack community [1] if you want to chat with others who are storing
sensor data in TimescaleDB.

[1] [https://slack.timescale.com](https://slack.timescale.com)

------
notinventedhear
TimescaleDB looks really very promising but this is a red flag:

"Hypertables support all standard PostgreSQL constraint types, with the
exception of foreign key constraints on other tables that reference values in
a hypertable"[1]

Naively I'd assume this could cause a two-colouring of your schema - the
partition that can use referential integrity and another with hypertables that
doesn't which feels like a pretty big trade-off.

[1] [https://docs.timescale.com/latest/using-
timescaledb/schema-m...](https://docs.timescale.com/latest/using-
timescaledb/schema-management#constraints)

~~~
cevian
In practice this doesn't come up a lot. Say you have a hypertable with
measurement(time, device_id, value) and a device table with (device_id,
device_manufacturer, device_type). Timescale fully support a foreign-key from
the measurement table into the devices table. This is a common usage. A FK
from another table which references a measurement row is not supported, but is
also uncommon. To see why note that a part of the primary-key of the
measurement table is time and so conceptually the only type of table that
would want a FK into it is also a time-based table, and so the only real usage
is a 1-to-1 relation. That is also uncommon and can be gotten-around with
normalization.

------
einpoklum
> Building a distributed time-series database on PostgreSQL

Next order of business: Making mud pies.

PostgreSQL is geared towards transactional work. With time series, you
basically just append data occasionally, and do analytics. PostgreSQL is
terrible for analytics - its architecture is all wrong. 2 or 3 orders of
magnitude slower than the state of the art if not more.

~~~
pvorb
Which databases are good for analytics from your point of view?

In my experience, being able to do advanced ad-hoc SQL queries is priceless
for analytics. Timescale helps in scaling time series use-cases that used to
scale badly in plain PostgreSQL.

~~~
einpoklum
It's not just my point of view - it's well known in the research community,
and has been for decades.

For FOSS, have a look at MonetDB. For research-oriented systems, look for
publications regarding HyperDB or VectorWise/Actian Vector (VectorH in the
cluster version). Other commercial offerings are Vertica (formerly C-Store)
and SAP Hana.

PostgreSQL is not even something anyone compares against in analytics...

~~~
einpoklum
Oh yea, MemSQL and ClickHouse are also indeed relevant and in this category,
except that ClickHouse doesn't support all of SQL and any table structure, so
it's not a full-fledged DBMS.

------
dfischer
Anyone have examples of using this for financial / trading algorithm based
needs? I've been investigating solutions for a while now and haven't had much
luck on a winner.

Is it better to do aggregations with the DB or through some MapReduce method
(Google Dataflow?) and write that to a DB?

~~~
akulkarni
I believe we have some in our Slack channel [0].

Also one of our investors is Two Sigma so this is an area of interest to us.

If you're open to it, I'd love to learn more about your specific use case.
Want to chat sometime? ajay (at) timescale.com

[0] [https://slack.timescale.com](https://slack.timescale.com)

------
jadbox
Does anyone have experience how this compares with citusdb (also postgres)?

~~~
enordstr
We talk about sharding vs. chunking in the blog post and I would put CitusDB
in the former category. More specifically, TimescaleDB is focusing on time-
series workloads. To handle time-series workloads, CitusDB suggests combining
their extension with a third-party extension (pg_partman) (see their docs).

I have no experience with this combination myself, so don't want to speculate
about performance, etc., but when reading the docs it really seems like an
afterthought.

------
simplify
What are some use cases for a time-series database?

~~~
cevian
[Timescaledb engineer here] We like to say that time-series data is any data
that is insert-mostly with data associated with the most recent time period.
That's a pretty broad definition, intentionally so. We see usage in telecoms,
heavy industry, science, health, IoT, etc. It's really about recording the
history of your data as it evolves, instead of just the current state.

~~~
_-david-_
How does it compare with using temporal tables in a relational database?

~~~
manigandham
Temporal table = state of the data in the table at a particular time. Very
useful for auditing or seeing how data has changed between different periods.

Timeseries = data with a primary key that includes time, potentially with
other time properties. For example, metrics are commonly associated with a
value at some time.

------
TedLePoireau
I'm currently using influxdb v1.x and I'm not very happy with it for many
reasons (impossible to delete a value, no clustering in free version,...). Can
anyone who migrated from influxdb to timescale share his opinion ?

~~~
dominotw
There is no clustering in timescale either. One of the reasons I stopped
exploring this option.

~~~
mfreed
@dominotw: See the parent article =)

------
gsich
Is aggregation working? Last I checked it can only aggregate old values into a
different table. Which in turn makes visualising them painful.

------
minitoar
If you need to scale bigger, check out Interana (I am an engineer at
Interana). We've created a time series query engine & application. We have
clusters with over a trillion events.

