
Comparison of opensource time series db's - dataloopio
https://blog.dataloop.io/top11-open-source-time-series-databases
======
avifreedman
Great comparison, and I hope the state of the world keeps getting better for
TSDBs so we don't need to build our own at some point - but I disagree re:

\------------------------------------- Performing queries across billions of
metrics looking for labels that only match a few of them (a common scenario
with time series data at scale) is really slow in Cassandra. This is because
of the way it stores data in columns. This extends to any columnar database
including Google's BigQuery which all have a natural disadvantage with time
series data. \-------------------------------------

There's nothing inherently limiting in columnar databases that makes it slow
to match only a few elements that match only a few out of billions or
trillions of records.

... but a classic columnar store might not be as efficient for storage, or
might take 5-10x the nodes to return with the same speed with that kind of
filtering, depending on storage and clustering mechanisms used.

~~~
dataloopio
Hi, the wording could probably use some tidying up around that part and I'm
open to suggestions. However, I do think it's a big problem with columnar
based time series databases.

When somebody wants to query for a few points matching certain dimensions in
Cassandra there's no getting around the fact that you have to do a scan across
potentially billions of data points.

Whereas if the index lives outside in something relational like Postgres the
lookup becomes insanely cheap and you're not having to scan over a bunch of
data.

There are quite a few databases that don't have an efficient external index.
For those, running 10 times the number of nodes would certainly speed things
up, but it's probably just a good idea to avoid databases like that if you
want fast queries.

~~~
avifreedman
Sorry to keep being pedantic, but I think it's important to thinking about
approaches to scalable and performant TSDBs, and I still disagree :)

Your example re: Cassandra is a problem with a particular example of columnar
based time series database, not inherently with using columnar-store based
backends for time series data.

At Kentik, our in-house backend deals with 80+ columns wide (what would be
tags in TSDB) for primarily network data, and querying across tens of billions
of records (tens of devices of data for 90 days) usually takes .5-2 seconds.

That's deployed on ~7 backend data nodes, running heavily multi-tenant with
300k-2m records/second ingested and averaging 450 queries/minute across a week
(don't have a peak query # handy).

But there's also nothing that says that a columnar store database can't have
indexes per column built-in (vs. external).

------
user5994461
> I'm only interested in time series databases for use by developers and
> operations to store and retrieve data that pertains to the health and
> performance of the services they build and operate. Everything in this blog
> will judge the entries based on their suitability for that task.

That is a very particular problem, in which the data storage is a minimal [yet
important] aspect of the full system.

You're probably going the wrong route if you're trying to redesign your own
and you'll only realize that way too late when you'll have to design your own
metrics collection, own graphing, own alerting, own...

The standard proven open-source stack:

collectd/statd (metrics collection + whipser/graphite (storage) + grafana
(cute graphs and dashboards).

The latest fad is to replace graphite with prometheus (which is better in some
aspects but has it own fault).

Both these open source tools will satisfy your purpose.

HARDCORE LIMITATIONS: Both these open source tools are entirely single node.
There is no form of sharding nor high availability nor horizontal scaling.

(Rules of thumb: Should be fine up to 100 hosts and applications. Then get
ready to throw big hardware and tune retention aggressively.)

\---

Some quick maths:

8 bytes per metrics * every 5 second = 967 kB per metrics over the week

967 kB per metric * 100 metrics per host * 100 hosts = ~10 GB per week for
high precision

Any of the parameter can spiral by tenfold (depending on the setup, retention,
hosts, metrics per app...). That means going straight into TB range and
scaling issues where one node is simply out of the question.

\---

It's pretty clear that the open source solutions don't scale and are hard to
maintain... so what's next when we outgrow them?

Switch to the latest generation of monitoring tools. The two best solutions
are datadog and signalfx. They both accept custom metrics from your app.

And... oh wait I just noticed that dataloop.io is a new SaaS solution trying
to compete with them. Oops :D

~~~
dataloopio
The blog discusses storage sizes with very similar maths examples.
DalmatinerDB uses 1 byte of storage per point compared to at the top end
Elasticsearch using 22 bytes per point.

Build vs buy is an age old discussion. You won't convince anyone to switch
from one side to the other. There will continue to be people like you and me
who would prefer to buy, and others who want to build and run it themselves.
As you have found out I don't need to be convinced as I started a company to
address the issue of there being no good options to buy at the time. In most
cases, for monitoring micro services, I'd buy a SaaS solution. I founded
Dataloop 3 years ago so not really a new startup any more. We're past Series A
and starting to grow.

It's true that we compete with Datadog and SignalFX in that area although our
real competition is open source with 90% of the addressable market using older
tools like Nagios etc. As the shift to the cloud and micro services happens
I'm sure it won't be a winner takes all market. Dataloop tends to focus on the
enterprise end of the scale whereas Signalfx is more developer focussed and
Datadog is more operations and SME.

When you say best I'd argue that's subjective. Signalfx charges by the metric
and that gets very expensive. Datadog limits you to 100 metrics per node with
an agent based pricing model. Dataloop uses per node pricing that's much
cheaper with unlimited metric volume. We're aiming to keep the costs extremely
low by investing in highly efficient backend storage.

The reason people are moving away from Graphite to InfluxDB and Prometheus is
the dimensional data model. Graphite simply isn't as powerful. Similarly,
StatsD aggregates down to the service and doesn't help pinpoint the outlier.
Prometheus collects all metrics in their raw format far more efficiently and
will let you instantly drill down into what is causing the issue.

To answer your question about what's next after you outgrow open source
solutions that don't scale.. well that was kind of the point of the blog!
DalmatinerDB scales to millions of metrics per second on a single node and
linearly as you add additional nodes. It isn't exactly hard to maintain either
as it's based on Riak Core.

I guess the final thing to say was that this wasn't really an advert for
Dataloop. Our business model doesn't depend on selling database features.
Unlike other SaaS companies we're happy to release the work done on our time
series database for free and available as open source.

Why would we do that? Mostly because it's fun to do open source stuff. Also
because hiring Erlang developers is pretty hard and this gives me an excuse to
talk at conferences where they hang out.

We've had a team of people working on this stuff now for over a year and as
you've mentioned no open source time series databases really scale. It's a
problem we've solved and are giving away for free. I must be really bad at
conveying that message in the blog.

~~~
user5994461
A full metrics/monitoring/alerting solution and a metrics storage engine are
two different purposes. (The first one being solved by the latest SaaS tools,
including dataloop.io).

I limited the previous message to the monitoring use case because it is
already quite long and a topic of it's own but I'd like to address the storage
as well.

There are many reasons one would need a time series database for an
application. In which case he'd need that kind of comparison.

\---

There are a few things which I'd like to see about storage systems:

\- What kind of features does it have to compress and/or aggregate data? Does
it have any?

Some systems can take 4 bytes per int, other can take 50. Some can store diff,
some do not... That makes a huge difference.

\- Can it cluster horizontally? Also, does it scale write horizontally?

We can have 50 CPU systems with 10TB of SSD array noawadays, but we probably
won't. It's actually rather challenging to scale vertically on AWS/GCE (not so
much on softlayer), not to mention the nightmare of having a single point of
failure for maintenance and issues.

I suppose we get that with the read/write number per 1 node and per 5 nodes
systems, which brings me to the next point.

\- The performance numbers are somewhat misleading IMO.

You say yourself that you didn't do benchmarking. You're just taking some
random facts you found on the internet and showing that as data.

\- You should include the versions of the database in the table. Features
change over time.

\- Are you backing and contributing to DalmatinerDB? For some reasons, the
link between dataloop and dalmatinerDB wasn't clear to me on the first read.
(Not to mention, you're not even advertising your product or your company).

\- How much of DalmatinerDB magic is based on ZFS? Does it actually need ZFS
to run?

As far as I remember, ZFS is still a BSD/Solaris citizen only. (And don't tell
my that it's coming into the next ubuntu release, it's just an hypothetical
future until actually done ;) )

Anyway. It's a welcomed comparison. Good work =)

\---

> our real competition is open source with 90% of the addressable market using
> older tools like Nagios etc.

An interesting point of view. I personally consider 90% of the nagios market
to not-be-a-market at all. It belongs to people who only uses it because it's
free (as in no-money) and can be downloaded easily.

Free automatically brings the students, the amateurs trying things in their
garage/homelab, micro deployment where it's enough, many companies and people
who simply don't value their time or the quality of what they deliver, and
finally all who have no money whatsoever or can't go through the hassle of the
buying-stuff department.

~~~
dataloopio
The raw data in the spreadsheet, which I have continued to update and is now
up to 15 databases with 30 characteristics each, is indeed applicable to both
people who want to pick up a database and use it in their monitoring stack, or
who want to use a time series database for another purpose.

Some of the newer additions like Warp10 and Akumuli for example have comments
about being great choices for sensor applications and those who need highly
performant local time series storage written in C++.

That's a good point about the compression. I bundled that all in the row
'bytes per point after compression' which gives the end result for each
database but I haven't noted which ones actually compress. However, there is 1
database that uses lossy compression so that is noted. That's all in row 11
currently.

Someone else also raised the clustering question and I added a row for
'dynamic cluster management' to address the question of whether you could just
horizontally scale by adding a node without bringing the system down.

The performance numbers are a big problem regardless of how they are
calculated. Firstly, I'd like to address the issue of whether having them or
not is important. I believe it's incredibly important to have some kind of
idea up front before investing time trying out a database. If we agree that
performance numbers are essential, especially considering the database list is
up to 15 and growing and we need some kind of method to reduce the number to
trial, then we're onto the next problem. Benchmarking 15 databases in a
uniform way as a science experiment is about 7 master thesis worth of work
(there are actually several on the topic of benchmarking a few of those
databases together that are interesting reads). To put it plainly such a
benchmark won't ever happen and if if ever does someone would find a way to
undermine it for their use case and setup. Therefore the usefulness of the
numbers in the list is more a practical ballpark estimate as outlined in the
blog.

We did however benchmark DalmatinerDB and released easy to re-create results,
the exact box hardware, test method, code and a little graph. I used the same
mechanism to benchmark InfluxDB and got reasonably close to the figures they
released.

The version of each database is in row 29. It probably wasn't when you first
saw the sheet but I added it soon after along with a maturity field.

Dataloop is a SaaS company of which I'm a co-founder and we needed a database
a few years ago to build our SaaS monitoring product. None of the things that
existed at the time were all that appealing so we picked up an already open
sourced project (DalmatinerDB) and have put about 5 people working for a year
improving it under the direction of Heinz (the original author) and
contributed it all back.

The reason to contribute it back wasn't completely altruistic. I'm personally
interested in time series databases, as a co-founder I need to go on the road
and talk at conferences, and chatting about DalmatinerDB seemed like a good
way to get my slides up and a big 'we are hiring' one at the end. Also, hiring
Erlang devs is hard. The more community contributors we have the bigger the
potential hiring pool of known good people.

For the ZFS question there's the choice of SmartOS (which is what Heinz built
it on) or Linux. Ubuntu 16.04 has been out for several months and supports
native ZFS. Dataloop runs DalmatinerDB on ZFS on Ubuntu. Heinz and several of
his customers running Project Fifo (which DalmatinerDB came out of) usually
run it on SmartOS. I'm guessing far more people would be interested in running
it on Linux than SmartOS. The database is intrinsically linked to the
filesystem for proper running. It relies on the way the compression works to
achieve 1 byte per data point storage volume. As well as the way the
filesystem makes atomic writes. You can run DalmatinerDB without ZFS but you'd
quickly run out of disk space and your data integrity wouldn't be as well
known.

The bit I do agree with fully is the marketing sizing questions.. it's all a
gamble based on opinion. We can probably agree it's a huge area for growth
right now with more people moving to cloud and spinning up far more
infrastructure than ever before. All I personally know, being a SysAdmin most
of my career, is that things are getting more complex and open source isn't
scaling very well, or when it does you still need a team to manage it. If you
had asked big companies 10 years ago who would want to outsource their email
hosting I think you'd get a much different answer to today. The trend long
term is towards SaaS and I believe the 90% of companies who today are using
open source will see the same shift. Will I get the amateurs? Probably never,
but then the 90% market numbers come from our own research:

[https://blog.dataloop.io/2014/01/30/what-we-learnt-
talking-t...](https://blog.dataloop.io/2014/01/30/what-we-learnt-talking-
to-60-companies-about-monitoring/)

And then validated time and time again by others:

[https://kartar.net/2015/08/monitoring-survey-2015---
tools/](https://kartar.net/2015/08/monitoring-survey-2015---tools/)

I honestly don't think that even if the mega combo of Prometheus /
DalmatinerDB / Grafana was polished today it would eat into either Dataloops
or Datadogs business. Some people want to run their own, others want to buy
SaaS. Over time we're going to see the shift to SaaS that has happened with
most other products and by having a foot in the door with DalmatinerDB,
Dataloop should hopefully create some good-will and credibility from the
people who want to run a monitoring stack themselves. At least then if they
move on into an environment where they do want to buy SaaS hosted monitoring
we're going in warm with a good reputation.

------
Dowwie
Is there a publication date for this? That's a really important attribute to
include with comparisons like this.

~~~
dataloopio
Hi, that is a good point. It was all written yesterday so is pretty up to date
:) I've added some dates to the top of the blog post. Thanks!

