
Top Ten Time Series DBs - bglusman
https://blog.outlyer.com/top10-open-source-time-series-databases
======
vthriller
> Could you do it all in MySQL or Postgres? Possibly, but you'd have to write
> a lot of code to add the functionality many of these databases already
> provide.

Or you can throw in something like [0], I guess. (This thing is still in my
todo list though, so I can't tell anything beside the fact that this thing
also exists.)

[0]
[https://github.com/timescale/timescaledb](https://github.com/timescale/timescaledb)

~~~
manigandham
Any distributed relational database, especially with compressed columnstores,
will be better than any existing timeseries specific database.

Timescale, Citus, PipelineDB = postgres based but no columnstores. MemSQL,
MariaDB, ClickHouse = with columnstores.

~~~
saosebastiao
Time series databases have many many uses, not all of which require or even
benefit from column stores. Some uses may even be hurt by column stores.

~~~
manigandham
Like what? All those use cases are easily covered and better done by scalable
relational systems. Time series is just a narrow subset application.

~~~
saosebastiao
Time series is not even close to a narrow subset. There are tons of relevant
use cases for time series databases that conflict with the cost/benefit
positioning of column storage. What is someone to do if they need moving
window queries (thereby requiring a ts db), but also have to deal with heavy
mvcc transactions, lots of updates, weakly ordered inserts? These aren’t
uncommon conditions to deal with, even in time series data. Examples:

* Event sourcing collects events but also often has a notion of a “current” record, meaning inserting a new event requires updating a previous event to invalidate it.

* Financial transactions may involve many append-only tables, but which are linked with data in heavily updated tables, often requiring transactional mvcc.

* Logging events asynchronously or across several nodes often produces events that are emitted out of order which can sometimes span several time partitions, making insertion costly for column stores.

~~~
manigandham
None of that is difficult nor is it related to time-series data specifically,
they are just modern data warehouse features.

SQL already has window/analytical functions. Relational databases with
columnstores also have their traditional rowstores which support all the OLTP
features you need, along with easy joins across both table types. Many also
now pair a rowstore or in-memory segment with each columnstore for background
merges to handle rapid ingest and easy updates.

If you want a polished system, use MemSQL or SQL Server Columnstore Indexes,
or more manual work with clickhouse and others. We have 28 billion rows of
classic "time series" monitoring data in memsql compressed to less than 50gb
and complex aggregations return in milliseconds.

~~~
Licenser
28 billion data points in 50 gigabytes are not impressive for time series use.
That's nearly 2 bytes per data point, many time series databases achieve 1 or
less byte per data point.

~~~
manigandham
It's not a point, it's an entire row of data with dozens of columns and JSON.

 _Even if_ the data took twice as much space, it's still worth it to have
everything in a single data warehouse with easy joins and the full
expressiveness of SQL.

------
jordan_
I would also suggest checking out Timescale
([http://www.timescale.com/](http://www.timescale.com/)) - It's a extension
for postgres and does a phenomenal job

~~~
hanley
Have you used it? I've been looking at this for a couple months and am waiting
for Amazon RDS support. Curious to hear about people's experiences with this.

------
kalmar
Honest question: how do people use influxdb for monitoring and alerting? Our
metrics feed into influx, and I cannot get answers to simple questions like
“what is the failure rate” because arithmetic across measurements isn't
possible [0]. I could shoehorn things into a schema to make it work, but in
the limit I end up with one mega measurement.

[0]:
[https://github.com/influxdata/influxdb/issues/3552](https://github.com/influxdata/influxdb/issues/3552)

~~~
aequitas
Maybe try to use Graphite as additional query API for influx. We switched to
Influx from Graphite but some queries where unable or cumbersome to translate
into Influx query language (especially inside Grafana). In our case we use
Graphite-api with a Influx plugin instead of Graphite's own frontend.
[https://github.com/InfluxGraph/influxgraph](https://github.com/InfluxGraph/influxgraph)

~~~
_jomo
It's possible to write raw queries in Grafana by changing the query edit mode.

~~~
aequitas
True, but even then there are some queries you can't do which graphite can.
And you loose the nice editor grafana offers.

------
jmcgough
This seems more like an ad for DalmatinerDB. It's just weird to come up with a
ranking and then (of course) give your DB first place.

It does seem like an interesting tsdb though, love that it's built on riak.

~~~
Licenser
Outlyr has not build DalmatinerDB they've used it and contributed a bit to it.

------
sceadu
It may also be worthwhile to take a look at the CMU Time Series Database
lecture series:
[https://www.youtube.com/watch?v=2SUBRE6wGiA&list=PLSE8ODhjZX...](https://www.youtube.com/watch?v=2SUBRE6wGiA&list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9)

------
mattb314
I'm a little confused about the columnar database comment:

> Performing queries across billions of metrics looking for labels that only
> match a few of them (a common scenario with time series data at scale) is
> really slow in Cassandra. This is because of the way it stores data in
> columns. This extends to any columnar database including Google's BigQuery
> which all have a natural disadvantage with time series data.

I've pretty much only heard "columnar database" used as opposed to row store
database, and it seems like storing time series data in columns makes much
more sense. Could someone clear up exactly how "labels" (which I probably
don't understand) are so much harder for column stores to deal with?

~~~
Licenser
Because labels or dimensions are not stored in as a value but as a row
identifier in most implementations. That results in having to scan the entire
row space and look at every row name and see if it matches the lookup.

Storing labels in a row based system (like SQL) allows querying by value, not
column name which takes advantage of all optimizations and indexes making it a
lot faster.

That said there is nothing forbidding someone to do both, DalmatinerDB, for
example, uses a column-based format for metric values but a row-based format
(PostgreSQL) for dimensions.

------
bglusman
I submitted this mostly hoping to see if there were already any comments on
it, but I guess maybe now there will be if anyone has thoughts! I'd never
heard of Dalmatiner before, and they admit bias because authors maintain, but
they seem somewhat balanced in that I notice their spreadsheet here[1]
acknowledges a fair number of advantages InfluxDB has over them.

[1][https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCE...](https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit#gid=0)

~~~
Licenser
I helped to create that spreadsheet we tried to be as fair as possible and
whenever possible link reproducible, verifiable benchmarks (but then again all
benchmarks are lies ;).

------
gaius
A comparison of timeseries DBs without KDB? I call shenanigans. Also others
have mentioned Timescale.

This article is pure clickbait from someone who isn't a serious practitioner
in the field.

~~~
SifJar
> I set some rules to attempt to limit the scope, otherwise this blog post
> would never end.

> Only free and open source time series databases and their features have been
> compared. Therefore if someone asks “have you tried Kdb+ or Informix?” the
> answer will be no. They are probably awesome though.

would be nice to see how KDB compares though

~~~
gaius
32-bit KDB is free

~~~
pritambaral
Free as in beer though, not speech; which is what the author intended, I
believe.

------
Gepsens
Of course author would compare apples and oranges. Druid is a Time series
oriented bucketed OLAP database, not a two dimensional metrics db like
dalmatiner or influx.

While I like all of these databases they don't cover the same spaces.

This blog post is useless

------
mikepurvis
I'd be interested to see more commentary on the graphical front end side of
things. I loathe how slow and overcomplicated Kibana is, but it does provide a
very nice kickstart to the business of exploring data and pulling together
dashboards— if, of course, you're using Elastic.

My killer tool for this would be able to talk to PostgreSQL, have a charting
backend based on Canvas/WebGL (so able to handle thousands of points rather
than dozens), and be easily pluggable to add in new kinds of visualizations.

~~~
azylman
Have you seen Grafana? I've had very good experiences with it. I've never used
it with Postgres, but it supports it:
[http://docs.grafana.org/features/datasources/postgres/](http://docs.grafana.org/features/datasources/postgres/)

------
jwatte
We ran into performance problems with graphite that github.com/imvu-
open/istatd doesn't have. I'd much rather run the latter for production and
application monitoring!

It has, like, 10x the per machine performance of the others. (See also: COST)

------
frik
Please add also relational databases incl MySQL and Postgres (both work great
for time series) and Cassandra itself. While also kairosdb, heroic, blueblood,
hawkula are based on Cassandra, it can be used for time series as well.

------
leetbulb
Mentioned on this list, Druid is #1 is my book. Imply[0] has a very nice
system built around Druid.

[0] [https://imply.io/](https://imply.io/)

------
Redoubts
Are any of these embeddable like SQLite?

------
ekvintroj
I don't see GemStone there...

------
deepsun
No ClickHouse, no PipelineDB?

------
adamnemecek
“Top 10 anime time series dbs”

~~~
zaptheimpaler
omae wa mou shindeiru

