
First release candidate for TimescaleDB 1.0 - iamd3vil
https://blog.timescale.com/announcing-1-0-enterprise-ready-time-series-database-sql-open-source-400b4342dfa0
======
welder
Funny timing... just installed 1.0 rc1 this week. Dual-writing and
experimenting reading from TimescaleDB in prod currently.

So far it's been perfect... besides a small config issue[1] causing a crash
that should have been detected when Postgres starts up, TimescaleDB gives me
reliability and durability of Postgres without having to manually create
partition tables.

What will happen when Postgres supports auto creating partition tables for
declarative partitioning[2]? Will TimescaleDB still be necessary?

Feature request 1: when all rows of a chunk/partition table are deleted can
that chunk be automatically dropped instead of having to manually run
drop_chunks[3]?

Feature request 2: for the getting started docs, add an option to install with
[https://nixos.org](https://nixos.org)

[1]:
[https://github.com/timescale/timescaledb/issues/733](https://github.com/timescale/timescaledb/issues/733)

[2]: [https://www.postgresql.org/docs/11/static/ddl-
partitioning.h...](https://www.postgresql.org/docs/11/static/ddl-
partitioning.html#DDL-PARTITIONING-DECLARATIVE)

[3]:
[https://docs.timescale.com/v1.0/api#drop_chunks](https://docs.timescale.com/v1.0/api#drop_chunks)

~~~
cevian
Thanks for trying out TimescaleDB. Glad you're having a good experience. :)

We think that declarative partitioning is a great step forward for Postgres
partitioning. That said the process is still quite manual (including in PG11)
-- unless I am mistaken, auto-creating partitioned tables is not in PG11, but
perhaps may get released with PG12 at the end of 2019.

But even with auto-creating partitioned tables, TimescaleDB is still necessary
for a few reasons:

\- Support for multiple time-based query optimization in the planner

\- More efficient handling of both queries and inserts when there are many
partitions,

\- Time-based analytical function like time_bucket,

\- Support for changing partition sizes, etc.

We are also working on many new features, optimizations, and capabilities
slated for the next year which will improve time-series workloads in
PostgreSQL even more.

------
atombender
Is TimescaleDB a good fit for events?

For example, we have data where each record is a tuple of (time, event,
object), where the event is things like "viewed", "performedSearch", etc., and
the object is event data as JSON. Let's say the object is a movie, in which
case the payload might be something like:

    
    
       {"id": 123,
       "name": "The Godfather",
       "director_id": 456,
       "genre": "crime"}
    

Our reporting UI lets users do aggregations based on arbitrary dimensions, so
we might do the equivalent of:

    
    
        select object->>'genre' as genre,
          object->>'director_id' as director_id,
          extract(month from time) as month,
          count(*) as count
        from events
        group by month, genre, director_id
    

Then we do things like pivot/nest values to display the groupings.

It's unclear to me whether TimescaleDB fits this use case, or whether we'd
have to change how things are indexed. Right now we just index the whole
object itself. Another scheme I've considered is that we could index it with
the values as keys; so for example, the above event would be {"name:The
Godfather": 1, "director_id:456": 1, "genre:crime": 1} and essentially
represent the counts. A counting aggregation would then be rewritten as as a
sum(). But it's unclear to me how you do intersections here without also
creating all the permutations (i.e. something like
{"director_id:456/genre:crime": 1}) beforehand.

We're currently using Elasticsearch for this. Performance is okay, but we're
not entirely happy with it.

~~~
nh2
I, too, would be very interested to find a Postgres-based replacement for
Elasticsearch.

Specifically replacing ELK by Postgres-Kibana.

Requirements for that:

    
    
       * Events are (timestamp, {arbitrarily nested JSON object})
       * Filtering by timestamp must be fast
       * Full-text search on the object is required
       * Exact constraints on all object keys must be fast
       * It should be possible to define indices on the object's fields so that WHERE clauses are fast
       * Counting the number of results should be fast, or at least have fast reasonably accurate estimates
       * Support typical Kibana searches and filters
    

I have tried so far to implement Kibana's access patterns on Postgres, and got
quite far, but never got past the problem of
[https://wiki.postgresql.org/wiki/Slow_Counting](https://wiki.postgresql.org/wiki/Slow_Counting),
which essentially means postgres must scan the whole table if you write a
WHERE clause, even when using indices, because it has to double-check whether
the returned rows weren't actually deleted.

~~~
cevian
This seems like it could fit well with TimescaleDB but obviously would take
testing. My only concern would be with Full-text search on JSON which I think
is possible but I have never done. I would start with a timescaleDB hypertable
on the Even table (time TimestampTz, object JSONB) with the following indexes
(or some of them depending on testing):

\- BTREE(time DESC)

\- BTREE(time DESC, object)

\- GIN(object)

\- some kind of full text index

I don't know why the slow counting problem would be a problem with WHERE
clauses since indexes are highly optimized to work with MVCC (e.g. hint bits
etc). The wiki article itself says this isn't much of a problem when using
indexes. But maybe you can elaborate?

~~~
nh2
I found that the full-text search on JSON worked remarkably well with Postgres
10 -- surprisingly this was the least of all problems.

The issue with slow counting is this from the wiki page:

> PostgreSQL will take advantage of available indexes against the restricted
> field(s) to limit how many records must be counted, which can greatly
> accelerate such queries. PostgreSQL will still need to read the resulting
> rows to verify that they exist; other database systems may only need to
> reference the index in this situation.

Typical scenario:

If you make a trivial query that matches a lot of rows (even when using an
index). You want to count the number of results in order to tell the user how
much they probably have to scroll through (quite important when digging
through logs, to know whether you'll have to scroll through doable 3 pages, or
impossible 3000 pages).

    
    
        SELECT COUNT(*) FROM logs WHERE object->>environment == 'production'
    

Then the WHERE will match 100 million rows, and postgres will scan them all
for existence (due to reason quoted above), no matter if
`object->>environment` has an index on it or not.

This will take many minutes, even on SSDs, just for showing a COUNT.

~~~
atombender
Yep, the reason ElasticSearch is fast here is that the underlying Lucene
indexes essentially form a column-oriented database. This is superb for low-
cardinality fields like "object->>environment"; if it has just a handful of
values, then only those values are stored, as a sorted list of postings.
Intersection with other field-based constraint are vector operations and can
be super fast.

I suspect that to make a fast-counting time series mechanism for Postgres,
you'd need to create a new index type that used a columnar approach (or even
used Lucene unerneath). I don't know much about what optimization options are
available to Postgres extensions, but it doesn't sound like it would be
impossible.

------
RobAtticus
Just to give a little (estimated) timeline:

* Currently our first release candidate is available via Github and Docker

* We are aiming to release a 2nd release candidate next week

* Sometime shortly after that (1-2 weeks) we'll go final 1.0.0

Happy to answer any questions!

~~~
scrollaway
Congrats on the upcoming release!

... any news on RDS? :)

I did my part and registered my company's interest but I don't think there's
anything else I can do.

~~~
RobAtticus
Nothing to share at the moment, but we appreciate you pinging them!

------
z0mbie42
Hi, it's very interesting!

Do you test if TimescaleDB affect the performances of the relational part of
Postgres ? In other words, can I use TimescaleDB as my primary DB to store
both my users and time series data ?

~~~
mfreed
We have not observed any effects on relational table performance. Basically,
TimescaleDB code in on-path for each parsed query, but if the query does not
specify a hypertable, we just let the query "fall through" to Postgres'
standard query planning/execution.

But you do point out a very common use case and strength of TimescaleDB: you
can easily JOIN your time-series data (hypertables) with existing relational
tables.

------
qaq
Nice product for small/mid scale workloads. It's no Vertica or ClickHouse but
if you do not need the scale should work well.

~~~
cevian
To put a number on this claim. TimescaleDB at this point can handle up to 100
TB of data.

~~~
comboy
Where does the limitation come from? Or is it just the amount of data that
it's been tested on?

~~~
mfreed
TimescaleDB clustering is currently limited to a single primary with multiple
read-only replicas (which provide both HA failover and scaling concurrent
reads).

So the limitation here is the storage capacity available for the nodes, and so
the above reflects the amount of data we've seen in use in various RAIDed or
network-attached storage environments (like the cloud).

One interesting thing here is that you can "elastically" add a new disk to a
existing hypertable, and new writes will be automatically load balanced across
the new disk. (In Postgres speak, we support multiple tablespaces in a single
hypertable, and we allow you to dynamically add a new tablespace to an
existing one:
[https://docs.timescale.com/api#attach_tablespace](https://docs.timescale.com/api#attach_tablespace)
)

------
sciurus
Congrats on the RC! TimescaleDB is a very neat idea, there's a lot to gain by
building on Postgres.

There's also a very serious limitation due to that: the requirement to
predefine schemas. My primary use case for a timeseries-focused db is storing
system and application metrics. Using a commercial (e.g. datadog, signalfx) or
open source (e.g. influxdb, prometheus) timeseries product I can submit
arbitrary data. If I had to perform a schema migration every time a developer
wanted to record a new metric, it would be extremely painful.

If this has changed since I last looked at TimescaleDB, please correct me!

~~~
mfreed
This is _not_ the case.

TimescaleDB has full support for storing JSON data (inherited from postgres),
including with indexes, so you do not need to fully pre-define your schema for
these types of applications.

In fact, we added support for TimescaleDB to be a read/write backend for long-
term Prometheus metrics. You pull from Prometheus via its remote storage
backend, and the data appears automatically in TimescaleDB. But then unlike
Influx and native Prometheus, you get to JOIN it against additional metadata
for richer questions. For more information:

[https://blog.timescale.com/sql-nosql-data-storage-for-
promet...](https://blog.timescale.com/sql-nosql-data-storage-for-prometheus-
devops-monitoring-postgresql-timescaledb-time-series-3cde27fd1e07)

[https://github.com/timescale/pg_prometheus](https://github.com/timescale/pg_prometheus)

~~~
sciurus
Cool! It looks like the metrics view you build on top of the values and labels
table makes it _reasonably_ easy to query. I still worry about how you get
good autocomplete in Grafana, though.

Supporting receiving data from Prometheus is nice, but for people who aren't
already invested in that it would be helpful if you either

a) picked an agent (e.g. Telegraf or Collectd) and taught it how to submit to
TimescaleDB directly

b) picked a protocol already commonly used by agents (e.g. graphite plaintext
protocol) and taught TimescaleDB to receive it

------
WalterGR
I feel like I learn of a new DBMS a couple times a month. Can anyone comment
on how well these upstart DBMSes tend to be tested / _proven_?

~~~
twunde
For new databases that I'm interested in, I generally check if there's a
Jepsen analysis for it
([https://jepsen.io/analyses](https://jepsen.io/analyses)) or whether it's
being open-sourced by a company that's been using it in production for a while
like Foundation DB
([https://news.ycombinator.com/item?id=16877395](https://news.ycombinator.com/item?id=16877395)).
In either of these two cases I would say it's well-tested. Otherwise, it's
often hit or miss. Thankfully the number of new db products I've seen is fewer
than the halcyon days of the MongoDB wave (2011-2014) and tend to be a bit
more focused on specific problems that the db solves.

~~~
pritambaral
Jepsen only tests distributed operation. For non-distributed databases (like
PostgreSQL, which is — for the purposes of this discussion — what TimescaleDB
is), Jepsen does not apply.

------
temuze
Any updates on full, scale-out clustering?

And are there any plans on supporting columnar storage, for very wide tables?

