
Time Series Databases to Watch - SCHKN
http://devconnected.com/4-best-time-series-databases-to-watch-in-2019/
======
theomega
I have used TimescaleDB for several purposes. As it is built on top of
Postgres, all the existing tools, libraries and processes work out of the box.
This is a huge advantage if you are operating Postgres anyway: Your existing
backup tools will work, as does your user Managment.

Scaling out is a more complex story: Read Scaleout works seamless using read
replicas (another existing Postgres mechanism). Replicas can also help you for
high availability, Write scaleout is difficult with TimescaleDB (same for
Postgres): Sharding can be an option, or buying bigger machines.

~~~
NewsAware
We ingest a lot of time series (IoT) data and use Postgres for other data so
Timescale works quite well for us. One thing Timescale treats as second class
citizen though is updates to existing data points which are 1-2 areas of
magnitude slower than inserts. Granted this is also the case for all other
TSDB solutions out there which are for obvious reasons optimized for inserts
and reads aggregated along the time dimension. Still would be amazing if you
could add to the already existing differentiation of allowing fast updates for
cases like ours where we dont store events relating to a singular point in
time but rather time-spans so new incoming data points might be "merged" into
existing time-spans.

~~~
eternalban
Same domain. I've come to the conclusion that these systems are dualistic in
nature and the architecture needs to reflect that. I think the more general
expression of this would be a time-series -> document-store pipeline. The
concern then, if the time-series processing step requires domain data, is
serving up fresh views of the evolving state in toto. That epsilon of temporal
lag of 'current state' in the premier stage is a product issue. For some use
cases it can be a very tight constraint, but I think quite a lot of others
will be perfectly fine with snapshot isolation with generous requirement on
freshness.

------
cbcoutinho
I work at a power plant that uses Wonderware - a SCADA system that stores its
time-series data in its own proprietary database that is accessed as a linked
server via SQL Server. That means any query I want to fetch can't be optimized
by the SQL Engine and makes any kind of analysis very expensive.

From my understanding, something like an integrated extension to the server
(not a linked/foreign server) would be great because the query optimizer would
be able to plan its queries based on that information.

Has anyone been in a situation like this and taken steps to mitigate the
inefficiencies of a system like this?

~~~
Dangeranger
This is exactly what Timescale does with PostgreSQL. Timescale is an extension
to the database, and all the existing database query planning systems just
work.

I am not aware if MS SQLServer has anything equivalent.

~~~
guscost
SQL Server has a column-oriented index, which makes certain aggregations in
wide tables much faster than anything Postgres can do currently:
[https://docs.microsoft.com/en-us/sql/relational-
databases/in...](https://docs.microsoft.com/en-us/sql/relational-
databases/indexes/get-started-with-columnstore-for-real-time-operational-
analytics)

I’m not aware of any out-of-the-box support for smart partitioning by time
(Timescale’s main feature). You could set that kind of thing up manually but
it would be a fair bit of work.

~~~
greggyb
You definitely don't want to use that columnstore for _ingesting_ time series
data. If you need real time reporting, you are better served with row-store.

Columnstore engines _tend_ to be optimized for read, rather than write. I know
for certain that Microsoft's columnstore technology is not a good fit for true
real time applications.

~~~
billgraz
You can build a columnstore index in two ways in SQL Server. The easiest is as
a non-clustered index on a regular rowstore table. Your inserts are going into
the rowstore directly. This physically stores the rows as a rowstore with a
columnstore index on top.

The second is as a clustered columnstore index. This stores the rows directly
as a columnstore. That's the way I'd store time series data. In this case, all
the inserts go into a "delta rowstore" and are moved by a background service
into the clustered columnstore structure. Your inserts proceed as fast as any
insert into a rowstore. There is overhead to move the data since it's doing a
fair bit of compression. But that's going to be true of anything that does
compression.

If you want really fast ingest, you can use an InMemory OLTP (aka Hekaton)
structure and layer a column store index on top of that. And that InMemory
structure can be persisted to disk. That gives you much better performance
than a plain rowstore. You'd probably want to write something to move the data
to a longer term structure but that would be the fastest way to ingest it I
can think of. Using SQL Server :)

~~~
greggyb
It's a good overview you've shared of the storage technologies in SQL Server.

I agree on memory-optimized being ideal for a real time service. The overhead
of column indexing has ruled it out whenever I've benchmarked it for real time
use cases. Last time was on SQL 2016, so latest optimizations may have brought
it to par.

------
11thEarlOfMar
We're planning to deploy ClickHouse from Yandex[0]. Would like to hear from
anyone who has it in production already, and what is your experience with it.

[0] [https://clickhouse.yandex/](https://clickhouse.yandex/)

~~~
shaklee3
Any reason you choose clickhouse over druid or Pinot?

~~~
bsg75
Simplicity.

In our case, the simple deployment and management of Clickhouse is a key
feature. It is masterless [1] with no namenodes or coordinators, so each
machine looks the same, and there is only one process to manage.

If you rely on its replication mechanism for sharding, Zookeeper becomes
necessary, but writing directly to nodes in an orderly fashion is also an
option (as we are).

[1] This means nodes are not aware of what each other contain, so queries hit
all nodes with some maybe having no work to do. Depending on your workload
this may or may not be a concern.

~~~
tepidandroid
This is interesting. In effect, you're doing your sharding client-side?

What do you plan to do if you need to add new nodes and/or you need to
rebalance nodes due to concentrated data access patterns? How do you handle
cross-node queries like joins?

~~~
bsg75
> In effect, you're doing your sharding client-side?

Correct. Some ETL inserts to specific nodes when sharding is necessary, in
other cases Kafka engine tables on a group of nodes subscribe to common
topics, and we simply let the whole cluster participate in queries. This works
just fine when table scans are acceptable.

Rebalancing is a missing option here, short of moving partitions manually. But
in my specific use cases, I have not yet needed to rebalance across nodes.

Note using native Clickhouse replication is still an option if we need it. One
cost to it is the extra work needed in the database cluster, so addressing it
in an eariler layer works for us.

> How do you handle cross-node queries like joins?

If I understand your question, since we are using the Distributed view type
across the cluster definition, a query on any node will receive data from the
others as part of a join-less SELECT, and federate on the node with the client
connection.

We are not doing any database-side JOINs currently. Plans are to augment data
in ETL, or join data post-query (potentially Spark). Clickhouse dictionaries
handle simple cases.

~~~
tepidandroid
Thanks for the details, this is very helpful.

------
suls
Also worth mentioning kdb+ since the title doesn’t seem to limit choices to
open source TSDB only.

~~~
new4thaccount
I've been told the J programming language's Jd database is similar in concept
to Kx System's kdb+ although far cheaper.

------
dqpb
RedisTimeSeries also looks interesting:
[https://github.com/RedisLabsModules/RedisTimeSeries](https://github.com/RedisLabsModules/RedisTimeSeries)

------
heinrichhartman
> InfluxDB is a completely open-source time series database

Well, InfluxDB is not completely open-source. They have a free-tier, which is
open source, but it is not clustered. So your data will not be high-available
and you can't scale beyond a single node.

~~~
SCHKN
That's true. They base their business model on providing HA clusters stored on
AWS.

Thanks for the clarification.

------
Gravityloss
Have used RRDtool in the distant past. It's a fascinating project from a time
when things were designed with hardware limitations in mind.
[https://oss.oetiker.ch/rrdtool/](https://oss.oetiker.ch/rrdtool/)

RRDtool stores its data values in a circular buffer (or many buffers, at
different resolutions) so performance is constant - it doesn't degrade over
time. Also resource usage is constant. The price you pay is that older data is
stored at coarser resolution. I assume it also made the data points fixed in
time for performance reasons, which is fine in my opinion.

I don't know if there are modern integrations or reporting tools built on the
database.

~~~
asen
Here is an example: [https://github.com/asen/smg](https://github.com/asen/smg)
(I am the author)

------
dominotw
there is lecture series on cmu website

[https://db.cs.cmu.edu/seminar2017/](https://db.cs.cmu.edu/seminar2017/)

6 tsdb vendors talk about their DB.

------
matt2000
Would anyone mind sharing what time series DB they use in production, and what
for? I'm assuming most are used for metrics in addition to a standard SQL
database, but interested to find out if that's accurate. Thanks!

~~~
thisone
Apache Druid.

The killer for us is that we allow essentially ad-hoc querying over long time
intervals, but require the results to be returned quickly. And the dataset,
while not Google proportions, isn't small.

~~~
dvlsg
Ad hoc querying as in users can choose their own time range / bucket size /
something else?

~~~
thisone
Intervals, filters, aggregates, and granularities.

------
tyingq
I know it's not specifically a TSDB, but I would have mentioned the ELK stack
since it is often used in contexts that cross over with these.

~~~
SCHKN
Definitely a good point. Stacks such as ELK are often used for cross purposes,
and the development of Grafana Loki might be a good example of it.

Thank you!

------
oever
What's a good time series database for simple sensor applications like logging
temperature and pressure with a raspberry pi?

~~~
TickleSteve
...a CSV file?

seriously... simple n easy.

~~~
oever
The raspberry pi is on spotty wifi and needs to sync the data to a data
collection server now and then, but also have fairly low (5 min) latency.

CSV and rsync could do it.

------
Nihilartikel
Druid.io didn't make the list, but it's worked really well for me in the past.
A bit of a chore to get running though. Great for real time aggregation and
and is fast even with custom JavaScript filter functions. Also supports
limited SQL, mostly not able to do joins.

------
thelastbender12
Not quite timeseries related but what are some good query processing solutions
that let you make uni-temporal queries without too much application code? For
example: assuming a simple application that employs event-stream based
storage.

* New users filling up their profile - {'id': ,'name': ,'city': , ...}

* Older users making updates to their profiles or even adding new fields - {'id': ,'city': }, {'id': , 'new_field': }

Which database solutions make it easy to write queries like - "give me the
profile information for id: X as of May 3rd 2018"? A document DB like Mongo
definitely supports this but joining data across different streams doesn't
seem quite straightforward as joining SQL tables.

Thanks!

------
bsdpqwz
Azure Data Explorer (Kusto) might be one to add to the "To watch list"

[https://azure.microsoft.com/en-us/services/data-
explorer/](https://azure.microsoft.com/en-us/services/data-explorer/)

We're migrating from a 50-node Elasticsearch to ADX, imho: \- amazing query
language (KQL) \- less work to maintain cluster \- lower cost

(it appears to be similar to Clickhouse, but more feature rich)

------
bsg75
Is Prometheus in the same category as others discussed in the article and
here? IIRC in the open-source edition the scaling options are similar.

~~~
zaphar
Yes, and nowadays because of K8S it's taking much more mindshare.

Prometheus with the Cortex backend gives you distributed Timeseries storage
with Prometheus.

------
jimmcslim
I have worked extensively with systems addressing the Australian National
Energy Market, where data ticks either every five minutes (dispatch of
generation) or thirty minutes (settlement). I’ve often wondered whether a
proper TSDB would make my life easier when storing/querying meter and market
data, but I always ended up with SQL.

~~~
matt2000
My guess is that at 5min resolution (288 data points per day), you're probably
just better off going stock SQL. My understanding is that time series DBs
start getting useful at higher write loads that standard SQL DBs aren't
necessarily optimized for, and where you might want things like data to
coalesce into larger timeframes to save storage, etc. I'm not an expert
though, just what I've seen from my somewhat limited experience.

~~~
akulkarni
(TimescaleDB founder) One of the advantages of TimescaleDB is the additional
SQL functions that make time-series manipulation easier (eg interpolation,
locf, aggregating by arbitrary time intervals). So if you need any of that
functionality I'd recommend taking a look, even if you don't have massive
scale.

------
jmakov
Interesting that nobody mentions GPU powered DBs that can actually compete
with KDB+. There are also other "normal" DBs that have impressive benchmarks
like VictoriaMetrics and Dremio.

------
nzeeshan
InfluxDB is losing a lot of leads as their website is down!

~~~
secondtom
Cute dog though.

------
technimad
I’m really interested how these compare to the Splunk metric store. Which I’ve
had very positive experiences with. Anyone with experience in that area?

------
sashwatp
Is there an AWS alternative for timeseries data?

~~~
imglorp
I'm looking for this too. Advice welcome. AWS managed time series is not a
thing.

Timestream has been in preview for too long. Does that mean they're having
trouble with productizing it?

You can host anything in OP article in EC2 but you don't get the managed
features. PLUS most of those (eg timescale and influxdb) are seriously priced
for serious features like clustering/scaling/HA.

Plain Old Posgres is not awful for TS [1] but I'm also looking for something
better.

At $work, we're considering Plain Old Elastic, which is a managed service. It
scales well and if you don't have mountains of data, there shouldn't be
surprises. If you do have mountains of data, and you can tolerate older stuff
being slower in archive, you can throw it in S3 and query it with Athena,
while keeping the faster stuff in ES or Dynamo.

[https://grisha.org/blog/2015/09/23/storing-time-series-in-
po...](https://grisha.org/blog/2015/09/23/storing-time-series-in-postgresql-
efficiently/)

~~~
akulkarni
Our users have told us that AWS Timestream is quite expensive, about 10x more
expensive than TimescaleDB. It doesn't seem like it was designed with
operational workloads in mind. A more detailed comparison is in the works.
(TimescaleDB Co-founder)

~~~
imglorp
Good information, thanks!

I've been on the timestream preview list for months...

------
shaklee3
Any reason OLAP databases aren't in this category? Druid seems like one to
watch.

~~~
the-rc
Usually, time series databases use a variety of tricks to store data using as
fewer bits per sample as possible. An OLAP database sounds like a much more
generic system that will cover a larger number of use cases, but won't be
finely tuned resource-wise for this one aspect.

~~~
swaranga
Can I read about how the time series nature of the data allows the storage
engines to optimize different parts of the system compared to generic
databases? Some links? Papers? Interested to know.

~~~
the-rc
See Facebook's paper on their Gorilla design:

[http://www.vldb.org/pvldb/vol8/p1816-teller.pdf](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf)

------
DoctorOetker
times series database _software_

times series databases would be actual datasets, I was looking forward to read
about some very interesting time series datasets...

------
slifin
I'm watching Crux and Datomic

~~~
refset
As the product manager for Crux I must point out that we are not currently
optimised for time series queries, as there is no columnar compression in the
core bitemporal indexes. It is certainly something we have thought about
however. For instance, we have already translated a couple of TimescaleDB
examples into our own test suite [0][1] and we have created a sample
aggregation "decorator" that sits on top of the Datalog queries [2].

Whilst there are no immediate plans from the core team to add columnar
compression or otherwise improve our support for time series use cases, I
think that a lot can be readily achieved by building on top of what already
exists in the core. I very much look forward to seeing others experiment with
the possibilities here.

[0]
[https://github.com/juxt/crux/blob/master/test/crux/ts_device...](https://github.com/juxt/crux/blob/master/test/crux/ts_devices_test.clj)

[1]
[https://github.com/juxt/crux/blob/master/test/crux/ts_weathe...](https://github.com/juxt/crux/blob/master/test/crux/ts_weather_test.clj)

[2]
[https://github.com/juxt/crux/blob/master/test/crux/decorator...](https://github.com/juxt/crux/blob/master/test/crux/decorators/aggregation_test.clj)

