
Building a scalable time-series database on PostgreSQL - RobAtticus
https://blog.timescale.com/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2
======
endymi0n
While I appreciate PostgreSQL every day, am I the only one who thinks this is
a rather bad idea? The row based engine of PG is the antithesis of efficient
storage and retrieval of timeseries of similar patterns, yielding almost no
compression. Columnar storage should naturally be much better at that
(BigQuery, Redshift, Citus), culminating in purpose built stores like Influx,
Prometheus or KDB. Prometheus for example manages to compress an average
64-bit FP sample including metadata to just 1.3-3.3 bytes depending on engine.
As most DB stuff is I/O bound, that usually translates into at least an order
of magnitude faster lookups.

~~~
solidsnack9000
Not all time series data is metrics (which is what Influx & Prometheus are
adapted for). Any kind of audit log -- to support a customer service console,
for example -- is going to be a time series of fairly complex, multi-field
records. Orders (and order modifications), support requests, payouts and
transfers... Most of business activity falls under the aegis of time series,
nominally immutable, even when it's not numbers. Working with this kind of
data is definitely helped by a comprehensive SQL dialect.

To take a contrary position: whose infrastructure is so large that a
specialized SQL and storage engine (like Influx) is necessary? Actually not
that many companies...so why does infrastructure always end up with these
special snowflake versions of things?

~~~
otterley
Companies don't dream of staying small, and they want solutions that scale
with them. If they do end up growing, planning ahead with a purpose-built
solution saves time and resources recouping technical debt.

~~~
flukus
And many companies stay small or disappear because they wasted resources on
problems they didn't have.

~~~
otterley
It's a risk either way.

~~~
flukus
But an equal risk? For most companies it isn't. Their are more failed big
dreamers than companies that missed the big time because they couldn't scale.

~~~
otterley
It's a good question, and I don't know where the quantitative data exists to
answer it.

My personal preference is to hire experienced teammates who have solved
similar problems in the past, because at most levels of scale (other than,
say, Google's or Facebook's), few problems are truly novel. So it's reasonably
possible to predict where the business will be WRT technological requirements
in, say, 1 to 3 years, sometimes 5 if you're just getting started, and plan
accordingly.

In the case of this particular problem (time series data), there are some
pretty easy and scalable answers out there -- Datadog, SignalFx, Wavefront,
Prometheus, etc. to start with. So it's not like you have to waste a bunch of
opportunity cost going out and building a solution.

------
LogicX
Congrats on the release!

I'm in the process of migrating from influxDB to a combination of Kafka and
TimeScaleDB.

Has anyone beaten me to the lunch with testing if it works with
[PipelineDB]([https://www.pipelinedb.com/](https://www.pipelinedb.com/))? I
can see that being a powerful combination, perhaps also tied to their
forthcoming [Stride]([https://www.stride.io/](https://www.stride.io/))
analytics API.

~~~
zabomber
Out of interest, why are you moving away from InfluxDB? We're looking at
moving to it....

~~~
LogicX
We can't pull stats we need.

We tried: so many continuous query rollups... Eats into ability to write
data... at only 1000/sec (artificially tripled due to splitting into multiple
measurements to get around cardinality problem) Dropping writes even on 24
core 500gb ram, SSD Dedicated.

Can't scale when also used to query against.

Basically: their claimed performance looks great; but real works doesn't match
up. They've pushed fixing cardinality two quarters.

No reply on my github issue with broken top()

Compression is good... is about all I can say lol

------
RobAtticus
Didn't want to make the title too long, but I should note this is an open
source database. Here if anyone has any questions.

Also, github if you just want to see the code:
[https://github.com/timescale/timescaledb/](https://github.com/timescale/timescaledb/)

------
jnordwick
What do you plan on doing for benchmarking? I don't expect you to get
something like STAC done, but will you try to find general benchmarks that
others use?

I've been working with TS databases for a long time now, and it never fails
that every database vendor always has benchmarks showing they are the best (no
put that more bluntly, when you come out with your own benchmark suite and you
are the fastest/smallest/bestest I won't be surprised or believe you).

I don't expect you to be the fastest when having a row-oriented architecture,
and it would be an unfair comparison against the non-free databases, but I
would like realistic numbers.

Actually, if you came out 2nd to 3rd against competitors' workloads, I would
be far more impresses.

(But really? A row-oriented TS database?)

~~~
RobAtticus
Influx has a framework[1] we are looking at utilizing, while adding other
queries that they do not support (e.g., more complex predicates, non-time-
based ordering, JOINs). We hope this will give a more accurate set of
benchmarks for time-series workloads.

Definitely agree that sometimes its hard to take numbers from a company at
face value. As you said, we don't necessarily expect to show that we are the
best at all types of queries (and will include query patterns where we may not
be the best), but for those who need more SQL features and richer queries we
expect Timescale to perform well. Additionally, working with people's existing
stack and database is another key win.

[1] [https://github.com/influxdata/influxdb-
comparisons](https://github.com/influxdata/influxdb-comparisons)

~~~
jnordwick
That benchmark looks like its made to hide inefficiencies, if I understand it
correctly. Like why is JSON serialization time not included? If you made the
interface JSON over HTTP, you should have to live with the performance
implications.

I notice they only compare against free versions. I wonder if they have tried
against any of the commercial vendors? The comercial vendors right now tend to
lead the FSS pack by a very large gap.

~~~
mfreed
We've been using this framework to avoid the appearance of crafting our own
that cherrypicks only our strong suits. Don't disagree with comments that you
should account for JSON serialization cost in your benchmarks, but that's a
criticism of the suite, not Postgres/Timescale/our numbers.

To be clear: our tests are all using the standard PostgreSQL interface, not
JSON-over-HTTP. And we don't exclude any SQL parsing time in the DB.

~~~
jnordwick
I tried to read it a little more, that it is incredibly confusing to me. Did
you have issues with that too? I have access to a couple commercial products,
but just trying to find the basics to run are difficult. STAC was so much
easier to understand.

Give dataset, X, how fast does query Y return.

------
artellectual
Wow, lots of very critical comments. I for one think this is a very good idea.
I have a use case right now that pretty much fits the manifesto of timescale.
Having the power of SQL is very atteactive. I'm looking to move my current
setup to timescale. Will let you guys know how it goes.

~~~
stalller
Hey, Timescale team member here. Thanks for the <3\. Looking forward to
hearing how it goes for you.

Would also love to learn more about your use case... If you feel inclined to
share more, please shoot me an email andrew@timescale.com. Thanks again!

------
ckluis
from:[http://www.timescale.com/index.html](http://www.timescale.com/index.html)

Full SQL interface

Scale-out distributed architecture

Auto partitions across space and time

Single-node and cluster optimizations

Complex query predicates

Distributed aggregations

Native JOINs with relational data

Labels and metrics support

Geo-spatial query support

Fine-grained access control

Reliable (active replication, backups)

Automated data retention policies

------
jjirsa
At my last company we did tens of billions of sensor events per day into
Cassandra and I thank god our engineering team was smart enough to spend
manpower on product instead of writing yet another database

~~~
bigger_cheese
One thing I'm curious about is what are the differences between an operational
historian (like ABB, PI, Honeywell etc) and a time series database like
Cassandra or the one in this article?

I work a step upward in the stack (analyzing the sensor data and building
models based off of it - usually using SQL) so I'm not an expert on underlying
technology.

Over the last 15 years or so my work (industrial plant) has been big on
historians. We've moved from in house bespoke stuff from the 90's which as I
understood it was basically a custom written ring buffer that held 1-2 days of
live data and would get down sampled into oracle tables as 1min/ 6min/hourly
averaged data every hour or so.

Prior to this in the 80's we used a database technology called "IMS" but I'm
not as familiar with how this system worked.

Nowadays in all our newer systems all I hear from the low level guys is "OPC"
(which I gather is a messaging format) and historians. Time series databases
sound like they should fit in somewhere but I'm not sure where - given other
people use them for logging high frequency sensor data sounds like they could
be a solution.

~~~
jjirsa
To be clear - Cassandra is great for time series but it's not a "time series
database" \- it just solves all of the same problems (and then some), because
it gives you natural replication, real HA, cross-wan, let's you cluster/order
data on disk using big table like sorting, and scales over a thousand nodes
per cluster.

For sensor data collection, it's very hard to beat. Lay something like spark
or filodb on top for more flexibility and it's virtually unmatched.

------
sciurus
If your main use case for timeseries is a graphite/statsd replacement, another
PostgesSQL-based project worth looking at is tgres.

[https://grisha.org/blog/2017/03/22/tgres-0-dot-10-dot-0b-tim...](https://grisha.org/blog/2017/03/22/tgres-0-dot-10-dot-0b-time-
series-with-go-and-postgresql/)

[https://github.com/tgres/tgres](https://github.com/tgres/tgres)

~~~
gtrubetskoy
And Tgres requires no extensions to Postgres, you can use Postgres as-is (I
know because I wrote it :) ).

I'd also add the link to the post explaining the principle of how time series
is actually stored which makes it capable of receiving 150K+ data points per
second on relatively small hardware:

[https://grisha.org/blog/2017/01/21/storing-time-seris-in-
pos...](https://grisha.org/blog/2017/01/21/storing-time-seris-in-postgresql-
optimize-for-write/)

Fundamentally though, what timescale does is different in that they store (as
far as I can tell) every incoming data point, which is more of a "big data"
problem, whereas Tgres is all about what you can do with Postgres if you make
the series regular (i.e. evenly spaced) on the fly and store time series as
round-robin archives of fixed resolution and span, so I don't think comparing
Tgres and Timescale makes a lot of sense.

~~~
JimNasby
Not sure what you're trying to say about "requires no extensions"; Timescale
is an extension that doesn't modify Postgres itself, similar to Citus, and
very different from XL/XC, GreenPlum, or EnterpriseDB's proprietary offerings.

~~~
gtrubetskoy
I simply mean that no CREATE EXTENSION is necessary.

------
Dangeranger
This looks nice. I've had to roll my own PostgreSQL schema and custom
functions a few times for timeseries data and if this prevents the need for
that I'm impressed.

The author mentioned write performance, but didn't touch much on read
performance. Are there any benchmarks you can share in that direction?

Also the author didn't touch much on the space requirements or the node
splitting expectations.

Those two areas would be useful to explain in greater detail.

~~~
RobAtticus
Thanks for the comments. I (not the author, but a co-worker) mentioned in
another comment that we are working on more benchmarks that we hope to share
soon, particularly in regards to read performance. We've seen some impressive
speed-ups for aggregates grouped by time intervals, but we're still a bit
early in our tuning on the query optimizations to give out more numbers.
Again, hopefully in the next few weeks.

Space requirements can admittedly be a bit higher than some other time-series
databases or column stores. Since disk is cheaper than memory, and with easy
ways of doing retention, we think its a worthwhile trade-off for people for
now. But compression is something we will look to down the road. For our later
benchmarks we will include memory/space comparisions to help people evaluate.

As for node splitting, our clustering solution (not yet released) is being
worked on quite a bit as well. When its closer to being ready we'll include
our thoughts on how to best manage your partitions.

~~~
Dangeranger
Are you considering configurable automated downsampling triggers for inserts
over time buckets? Essentially these would be aggregates triggered to fire and
insert the resulting values into another table of lower granularity after the
time bucket has closed (1s, 10s, 1m, 5m, etc.)

~~~
cevian
This is a great usage idea and something that should be fairly easy once we
support close and update triggers (on our list of todos)

------
manigandham
Most distributed SQL databases fall into 2 uses: large data-warehouse and
scalable primary data store.

For DW needs (including time series), MemSQL is still the best option for
distributed SQL unless you absolutely need postgres features. Proprietary
w/mysql flavor but integrated columnstore + memory rowstore built for
performance and fantastic ops management software makes it a great system.
Also has interesting kafka and s3 ingest pipelines.

For an operational db, Citus is a great scale-out option for postgres. You can
combine it with cstore to get columnar compression for DW use but at that
point it's better to just use a more focused product. There's also CockroachDB
which looks good (if it ever gets out of beta). I'd also recommend ScyllaDB +
JanusGraph once both mature to get a scalable multi-master graph database if
HA/reliability/replication are a big concern.

Timescale DB is interesting and it's always nice to see more options, but
right now there are better data warehouse options already that work just fine
for time series + SQL. A built-in columnstore would go a long way to making it
competitive though.

------
Tommek
Whats the main difference compared to pg_partman? That provides parent and
child tables, automated partitioning and the user only queries the parent
tables.

[https://github.com/keithf4/pg_partman](https://github.com/keithf4/pg_partman)

~~~
RobAtticus
TimescaleDB does auto-sizing of the child tables (chunks) to keep write
performance relatively constant while also making retention policy easy to do.
That is, we make sure the chunks are aligned to time unit (e.g. an hour or a
day) so when you want to remove data its dropping whole tables rather than
individual rows.

Additionally, TimescaleDB comes with optimizations on the query side
specifically for time-series data.

~~~
Tommek
Ok, autosizing is missing in pg_partman. Retention works the same (just
dropping childs). I've read the paper and there it gets more clear for me.

So you distribute the childtables to several nodes of a server cluster.

Is network latency a problem? I guess one should colocate the servers in one
location rather than spread it out?

How good does it work when nodes die?

Do you use query parallelization (available since 9.6 in vanilla) on a single
node and across different nodes?

~~~
cevian
Yes we distribute the child tables among the cluster. Our default distribution
mode uses "sticky" partitioning where a partition prefers to stay on the same
node. This allows you to control data colocation via the partition key. Our
clustered solution is not released yet but we plan to handle node failures via
regular postgres STONIH mechanisms. Once node failure is detected, the system
reconfigure which children to use.

Query parallelization works in single and multi-node cases.

~~~
mfreed
And a clarification just in case this wasn't clear: We use constraint
exclusion analysis to determine which of the "child tables" of the parent
hypertable to query. So, when performing query parallelization, you aren't
sending subqueries to chunks/nodes that will not match your SQL predicate
(plus a bunch of other query optimizations we do with pushdown, etc.).

This is also true at write time: You can insert a batch of rows in a single
insert, then the query planner will split them into sub-batches, which it then
will insert into the appropriate chunk. So, you don't need to "pre-partition"
your data before writing to the system (even in the forthcoming distributed
version).

------
marknadal
Loved this article, as a competing database company, they did a fantastic job
relating to developers and being authentic! Great job, please keep this up, it
will definitely make you a winner.

Key quote of interest: "135K writes per second per node" this is very good!
Quite impressed.

------
denfromufa
Has anyone used gnocchi outside of OpenStack?

It can use postgresql as an index store.

[https://docs.openstack.org/developer/gnocchi/install.html](https://docs.openstack.org/developer/gnocchi/install.html)

------
wsxiaoys
Congrats on the launch!

I was investigating the same topic (PG based timeseries database) for a stock
tick data project, would definitely give timescaledb a try.

Since financial data is mentioned in the blog, would be curious on how it
performed / scaled in practical.

~~~
RobAtticus
Thanks!

We're still preparing some benchmark numbers/performance numbers that we will
hopefully share in the coming weeks. We do have some write performance numbers
in there as you can see. There is also a lot of churn at the moment as we're
still in beta and refining some key features, so I don't want to speculate too
much on how performance looks until after we get a few more of our query
optimizations tuned.

------
koffiezet
So, having played/worked with a few time-series databases, this just doesn't
seem to fit the picture we're used to. For applications having to store a set
of well-defined time-series in a more optimal way, it looks great. As a
generic time-series database on the other hand, this sounds like a maintenance
nightmare.

I quite like Postgresql (and deploy it all the time), and I'm no fan of nosql
stuff, which just means you don't have to properly analyze your database
structure before-hand, but with time-series it's different matter. The data
you tend to send to generic time-series databases tends to be very
unpredictable. I currently don't care what data is sent to Prometheus or
Influx. This includes, but is not limited to ZFS stats of our storage, system
load, network traffic, VMWare, nginx/haproxy, application usage and errors,
... I know that when I'll need it, I'll have it available and can try to
correlate the data at any point in the future. In TimescaleDB it looks like I
would have to pre-create tables fitting all those stats, which would make it
an absolute pain in the ass.

As counter-intuitive as it sounds, to do it properly/flexible/usable in a real
world, you should impose a fixed/limited database structure. The tgres project
mentioned elsewhere in this thread seems a lot more useful for 'generic'
cases. Maybe they can be combined in some way.

Also, some other important things in the time-series world are a REST api. Yes
you might want advanced SQL to query data, but no you don't want that to
insert it. Pretty much any application can do http(s) stuff without any
additional libraries (except when you're talking about C/C++ things). Postgres
on the other hand? Oh no I need a JDBC driver for this Java app, I need an
extra Python library for X, ... Not to mention you suddenly have to manage
database pools in the applications and have something that is a potential
bottleneck.

Then, no Grafana support is a no-go. It is the de-facto standard visualization
tool these days.

What about retention policies, you don't need ms-accurate data-points from 2
years ago, when talking about many data-points, you might want to aggregate
this. Though since the data set in TimescaleDB will be limited due to specific
data-sets, this could be less of a problem.

~~~
RobAtticus
While you do need to create tables for your stats, you are able to ALTER TABLE
just as you can for normal Postgres without problem. It may not be as pain-
free as NoSQL in that regard, but building tooling for this or automating this
is certainly possible.

Alternatively if you do want pure blog storage, Postgres's JSONB datatype
works as well, although with some performance trade-offs. This can work quite
well if you have some structured data that lives alongside other unstructured
data. [1][2]

Building or putting a REST front-end in front of PostgreSQL should not really
be an issue (we built one for our hosted version), and there are already a
fair number of PostgreSQL clients/libs. That said, for ease of use we are
already thinking of adding an HTTP interface.

For Grafana: We were actually working on our own connector to support
PostgreSQL, but found out that one is already in the works by the Grafana team
(which will work out of the box with Timescale, because each of our nodes look
like PostgreSQL to the outside world). This doesn't seem like it'll be an
issue for very long.

We've mentioned retention policies elsewhere, but you do bring up a good
point. Instead of dropping data, support for aggregating data in older chunks
to a more coarse-grain resolution is something we are already looking into.

[1] [https://www.citusdata.com/blog/2016/07/14/choosing-nosql-
hst...](https://www.citusdata.com/blog/2016/07/14/choosing-nosql-hstore-json-
jsonb/) [2] [https://blog.heapanalytics.com/when-to-avoid-jsonb-in-a-
post...](https://blog.heapanalytics.com/when-to-avoid-jsonb-in-a-postgresql-
schema/)

~~~
koffiezet
I think you're a bit missing my point. My view is maybe a bit limited to the
'ops' side of things, but creating and maintaining tables completely defeats
how I'm currently using metrics, and I suspect this applies to most people
using them when I look at the available ops-targeted metrics collector tools:
Collectd, Telegraf, Intel's Snap, ... Going over the list of their available
modules/plugins/sources should give you an idea of how realistic it is to
maintain tables for all those metrics if you were planning on adding support
to them.

The main reason time-series databases are 'hot' these days is because ops
jumped on them, and I think understanding how metrics are being used there and
for what purpose is the key here. My first reaction when something generates
'events' or data - whatever that might be - is simply to push them into a
time-series database without thinking about them. Is it data from snmp (a
switch, firewall, ...), generic machine stats, database stats, application
metrics, ping statistics, time until an ssl certificate expires, ... you name
it - I don't care, maybe it'll be useful, maybe not. I don't even dare to
estimate the amount of different metrics I'm currently tracking.

Real-world example: did I think I had to know what the sizes of my different
caches on my ZFS storage units were? Not at all, but Telegraf pushed them
anyway - so whatever. They actually ended up being very useful tracking slow-
downs on our fileshares caused by some rogue process scanning all files on
them, completely trashing the caches in the process. I had the data right
there at my disposal, and I didn't really knew the details of what exactly I
was tracking until I had to take a closer look.

These use-cases are something TimescaleDB's approach on it's own would be
completely unsuitable for. As you mentioned, JSONB has performance trade-offs
and blindly using that for everything sounds like a recipe for disaster. It
also only addresses one problem, complexity to use is another, and everyone
having to define their own structure and insert queries is yet another: there
is no standard. While there are quite a few time-series out-there, most
popular tools support the most important-ones, but I don't see how you can add
support for TimescaleDB since the data-structure is completely undefined.

That's why I mentioned the tgres (which I had encountered before). Something
like that, backed by a Postgres database with TimescaleDB's extension could be
very interesting. Doing the 'generic' metrics through a simplified, well
defined interface with some fixed table-structures, while still allowing way
more powerful 'direct' metrics to address niche needs sounds very interesting
to me.

Currently however, I only see TimescaleDB useful within one well-defined
application, where it can be very valuable. In the grand scheme of 'ops'
things however, it feels too limited and inflexible, and not really like a
time-series database.

> For Grafana: We were actually working on our own connector to support
> PostgreSQL, but found out that one is already in the works by the Grafana
> team

Ah yeah, I forgot that the Postgres datasource type was added in Grafana, yes
that would do.

Other thing is time-series-specific functions, but I'm not really familiar
with all stuff Postgres offers, I know it's a lot - but I couldn't find stuff
like regression analysis/derivative functions. Or maybe it's my google-foo is
failing. Providing metrics-specific functions in the Postgres extension should
be possible though.

~~~
JimNasby
Something that might be very interesting would be combining
[https://www.torodb.com/](https://www.torodb.com/) with Timescale. ToroDB is
database agnostic (though they prefer Postgres). If you extracted the
appropriate keys (including the timestamp) out I expect the combination would
be very powerful.

------
jsulinski
How does this compare to say, Aerospike, or Honeycomb.io?

~~~
cyen
Honeycomb is also unapologetically a SaaS. We believe that - unless your
_company 's_ core competency is, in fact, managing databases and a garden of
myriad open-source monitoring tools - it makes sense for most people to
outsource their observability solutions.

(We also don't currently support joins, while TimescaleDB's joins sound pretty
dope :))

------
stalller
Hi from Timescale, here are 6 insights we took away from this awesome
discussion. Thank you!

[https://blog.timescale.com/thank-you-
hn-20-000-views-1000-st...](https://blog.timescale.com/thank-you-
hn-20-000-views-1000-stars-and-6-insights-from-the-timescaledb-
launch-d7e9c5cdb9d0)

------
joshpadnick
This looks promising! Somewhat unrelated question, but has anyone had success
using AWS DynamoDB as a time-series database? I'm generally curious on the
case for a "managed" TSDB versus setting up your own open source one.

~~~
takeda
Regarding DynamoDB, it is a key/value store, so IMO not a very good fit. I
suppose you could make it work for specific cases, but you only will be
running into issues, for example (note that I'm more familiar with Riak, which
is based on DynamoDB, so it should have similar limitations):

You could use a timestamp as the key, and put data as json, but now you need
to make sure that all data arrives at the same time so it can be inserted
once, otherwise you will be doing series of updates.

For querying you are very limited, because you mainly want to query by the
main key, which would be the time. You can get only one key at the time, you
can't request range of data or data fitting specific criteria (unless you use
some extra indices, which comes with own issues). If you have holes between
the keys you will need to find some way to make them predictable, because
listing all keys is a very expensive operation that only makes sense to use
when testing.

Expiration of old keys... generally it is not pretty either, riak has bitcask
backend which can expire old data, which could help, although the expiration
is more of "time guaranteed that given key(s) is not purged". I'm not sure if
DynamoDB has this as well, for example riak's leveldb backend doesn't allow
expiration.

There's no dedicated time series database[1] in AWS the closest thing to it
would be using a relational one.

[1] BTW: don't use "TSDB" initials, because that's a name of another time
series databases that runs on top of hadoop, it confused me a bit :)

------
dangoldin
Very cool. The partitioning reminds me of the way VoltDB does it. By splitting
it across a set of keys it's able to ensure that queries can be run
concurrently on their own node.

Are you guys familiar with it and I am understanding this properly?

~~~
mfreed
Don’t remember enough specifics of VoltDB’s partitioning/parallelization
details (or its academic H-Store predecessor), but that wouldn’t surprise me.

VoltBD’s focus on all in-memory processing and general OLTP workloads (so-
called ”NewSQL”) is taking a different point in the design/problem space
though.

------
ciconia
How does that differ from PostgreSQL's stock table inheritance (partition data
on arbitrary rules, while letting you efficiently query the parent table using
constraints)? The TimescaleDB approach apparently looks very similar.

~~~
cevian
We actually really like (and use) Postgres inheritance. Our extensions add
functionality specifically geared for time-series workloads. Namely, we add:

\- Automatic table-sizing. This includes dynamically creating new tables for
new data, closing old tables as soon as they get too big, etc. (more
challenging than meets the eye because of deadlocking issues, etc.).

\- Related to above, optimized inserts to correct chunks. In Postgres 9.6,
this is slow using rules/triggers. In upcoming Postgres 10 the situation is
supposed to be better (waiting for finalization of 10 to test). But in
Postgres 10, auto-closing and table creation are still not supported, and
implementing them would probably require the slow path. In contrast, Our
inserts are very fast.

\- Hash-partitioning query pruning according to partition key

\- (Time based pruning supported also, but that's standard Postgres)

\- Time-based query optimizations. For example, we enable the planner to group
by time-intervals using indexes on time-columns. Plain postgres can't support
that because it does not know that truncating/rounding time is order-
preserving. Other optimizations in active development.

\- Our upcoming clustered version will include distributed query
optimizations. In addition our scale-out design takes advantage of many
workload features unique to time-series (see white-paper for more details)

~~~
JimNasby
Out of curiosity, is it inheritance you like, or just the partitioning aspect
of it?

I've actually used inheritance far more than partitioning, but I know of
others that would love to see inheritance die.

------
pnathan
_Very_ interesting.

Can you speak as to the stability of your system for production usage?

~~~
RobAtticus
We are currently in beta, and so I don't think production use would be a great
idea at this point. We are looking to get things more stable relatively soon
so that more adventurous users may try us out in non-critical-path scenarios.

Right now we're looking for feedback as people use it on their own machines or
experiments. Don't want to mislead people into using this for something
critical and having it not work.

~~~
pnathan
Thanks for the clarification.

In the next few months, I'm going to be examining the feasibility of Postgres
as a timeseries store, with an eye towards deployment onto RDS for production.

Might have some feedback then.

Best regards- this project looks like The Right Way Forward.

~~~
RobAtticus
Thanks! Yes we've had talks about how to get our extension on RDS so hopefully
by the time you're ready we'll have that done and ready for you.

------
didip
I couldn't find the architectural diagram on how do you scale this across N
nodes, can you elaborate on that? Is it similar to Postgres XL or Citus?

~~~
mfreed
At a high-level, most distributed architectures take a similar approach:
Spread partitions over multiple nodes, insert/query to any node, route
(sub)requests to the appropriate nodes. The unique parts of Timescale is that,
unlike Postgres XL or Citus, our design/optimizations all focus on time-series
data.

So, as we've mentioned in some other replies, this means that our
partitions/chunks are typically based on two-dimensions (time and a primary
partition key), chunks across nodes are automatically created and closed
(based both on time intervals and sizes), and there are a bunch of time-based
distributed query optimizations and time-series-specific functions that we'll
provide out of the box.

More information in whitepaper:
[http://www.timescale.com/papers/timescaledb.pdf](http://www.timescale.com/papers/timescaledb.pdf)

------
ddlatham
_A single-node version is currently available for download. A clustered
version is in the works._

Adding clustering on sounds like the hardest part.

~~~
RobAtticus
Its definitely something we want to make sure we get right before releasing.
We do think single node can work for a lot of use cases though, especially
with our scalable insert/write rates and our time-series specific
optimizations.

------
nickpeterson
Are the hypertables effectively views with insert triggers on them?

Does Timescale have any baked in support bitemporal constructs?

Are temporal foreign keys supported?

~~~
akulkarni
It's a bit more complicated than that. At a high-level: views + auto-sizing of
tables + fast insert path + time-based query optimizations + (upcoming)
cluster support (+ more).

We do not currently support bitemporal constructs.

What's an example of a temporal foreign key query that you need?

~~~
nickpeterson
If you have two logical views of information, say people and their jobs.
Imagine the attributes of people change over time. Also the attributes of
their jobs change over time. It's possible to delete a person record from the
person table. What if a Job record references that person and it intersects
with the period of time that has been removed from the person. Suddenly the
data would be nonsensical. I don't ask this to be obtuse, but when I see joins
supported in sql on a time oriented database, I assume some position is being
taken on this.

Basically, does the database have an opinion on time oriented data
consistency?

~~~
cevian
Ah, sounds like you are again thinking about bitemporal reasoning over
relational data.

We are not really built for that use case. Instead of focusing on
fundamentally relational data such as people or jobs (which you are free to
store alongside our stuff, using plain postgres functionality), we store event
or metrics data. This data can be easily joined against the relational data
(e.g., using a person_id stored in both event and relational tables).

~~~
nickpeterson
I reread 'managing time in relational databases' like 3 times one summer and
now every time I mess with time in databases I have a heart attack.

That makes sense, thanks for answering ;)

------
mrkurt
Joins against time series data sound really nice.

~~~
RobAtticus
We like to think so! We provide some sample datasets for people to play
with[1][2] that have relational "metadata" tables alongside a time-series
table where you can do this.

One thing we'll note is that currently JOINs between two time-series tables
(what we call 'hypertables') are not optimized, but we're working on it! :)

[1] [http://docs.timescale.com/other-sample-
datasets](http://docs.timescale.com/other-sample-datasets) [2]
[http://docs.timescale.com/getting-
started/tutorial](http://docs.timescale.com/getting-started/tutorial)

------
jbverschoor
How does this compare to citusdb?

~~~
akulkarni
Author here. We actually know the Citus guys pretty well, and spent some time
comparing notes last week at PGConf (Hi Ozgun!). We're quite friendly.

At a high-level, Citus is solving a very different problem: general analytics
and transactions at scale, ideally for B2B applications. On the other hand,
Timescale is laser focused on time-series workloads.

------
andrestc
would be interesting to get that integrated with Prometheus remote storage.

~~~
akulkarni
We plan to add support through remote_read/remote_write, but it is still on
our todo list. (And we're open to suggestions.)

------
skorgu
Congrats on the launch!

------
otterley
How many dimensions does each metric support?

~~~
mfreed
Not sure if this is what you're asking, but some monitoring focused systems
have you associate each individual time-series metric with a bunch of
metadata/labels (either encoding into a single string
("dc1.rack15.server46.port2") or in some json/array.

We expose a more traditional postgres schema, which is a different way of
thinking about the data model if you come from a monitoring world.

So you can shove data into a postgres JSONB or array type if you want, but you
might more commonly have all your data items stored as individual columns. And
different/many metrics can be stored in a single row, so (1) you aren't paying
for the label cost with each metric, (2) our query planner makes it efficient
to support complex predicates on these metrics. Like:

SELECT date_trunc('minute', time) AS minute, COUNT(device_id) FROM logs WHERE
cpu_level > 0.9 AND free_mem < 1024 AND time > now() - interval '24 hours'
GROUP BY minute ORDER BY COUNT(device_id) DESC LIMIT 25;

~~~
otterley
The real power of time series databases comes from the ability to aggregate
metrics based on arbitrary dimensions. For example, to create a dashboard
containing stack graphs of http requests summed across all servers of a
particular application name, grouped by response code, one for every
application name, then separately sectioned in the dashboard by region - but
only for the servers in the production environment. And none of the dimension
keys or values are known to the database in advance.

Datadog, SignalFx, Prometheus, and Wavefront all have this ability. It's now
standard functionality for any reasonably advanced time series monitoring
implementation.

The fact that your implementation happens to be backed by Postgres is
interesting, but it's merely an implementation detail as far as power users
are concerned.

~~~
mfreed
You can query by arbitrary dimension (what we would think of as time
aggregates). In the future, you'd be able to materialize to arbitrary
dimensions with continuous queries.

~~~
otterley
How do you support arbitrary dimensions in a tabular database system without
massively blowing up the number of tables, the number of columns, or number of
sparse rows?

This kind of problem is really an OLAP cube problem, yet you're throwing an
OLTP solution at it. I've seen it tried before at several sites and it's been
abandoned each time for having crippling performance and management behaviors.
What are you doing differently that will make this attempt successful?

------
tomc1985
Wow, cool!

How does this compare to Citus?

~~~
akulkarni
Hi, Author here. I just answered this question on another thread:
[https://news.ycombinator.com/item?id=14037105](https://news.ycombinator.com/item?id=14037105)

But at a high-level Citus is solving a very different problem (general
analytics and transactions at scale, ideally for B2B applications).

On the other hand, Timescale is super focused on time-series workloads.

------
williamle8300
I'm a novice regarding DBs. What's the significance of this?

~~~
akulkarni
Hi, I'm the author. There are a few things worth noting:

1\. Time-series data (data collected over time, e.g., DevOps data, financial
data, sensor data) is a growing (and lately, quite popular) type of "big data"
thanks to a number of trends: more sources of data, cheaper storage, rise of a
monitoring culture, IoT, etc.

2\. One of the characteristics of time-series data is that it grows very, very
quickly (e.g., when collecting data every few seconds), making it hard to use
traditional relational databases for storage. In response, people have
developed specialized time-series databases, which so far achieve scalability
by sacrificing query power.

3\. We just released a new time-series database that takes a different
approach, achieving both scalability and query power. One of the benefits of
our approach is that one can use normal SQL to run queries, something that was
missing in prior time-series databases. Another benefit is that we are tightly
integrated with PostgreSQL, one of the most popular, reliable, stalwarts of
the database world. A lot of people find this approach useful.

If you collect any type of time-series data, then I welcome you to take a
closer look. Happy to answer any other questions.

------
sigi45
I really hate this stile of writing. Why does it have to sound like every
other hipster it text?

\- I'm not your padawan

\- Postgresql is cool, why do you say it is boring?!

\- yes awesome some PhD People are vouching with there titles to a peace of
software i might wanna use in a production environment? Nope. Tx for the
efford and i will watch it but no.

I do like that you do that work and tx for this but why not wrting a post
which goes into details instead of this hipster pseudo tl;dr text?

~~~
mfreed
Sorry the tone wasn't to your liking. For a more academic voice (and deeper
technical dive), you might enjoy our LaTeX-typeset paper instead:

[http://www.timescale.com/papers/timescaledb.pdf](http://www.timescale.com/papers/timescaledb.pdf)

~~~
sigi45
Hey, Thanks i really like that way more.

