
Thoughts on Time-series Databases - akerl_
http://jmoiron.net/blog/thoughts-on-timeseries-databases/
======
Everlag
I actually just migrated 20 million rows of Magic: the Gathering price data
from influxDB to postgres this week. For a few days of effort, I decreased my
query latency by an order of a magnitude; a full set query, roughly 270 cards,
went from 30 to 3 seconds with a cold cache.

The migration was prompted by influxDB 0.8 eating 50% of the VPS' cpu and 77%
of the ram while idling. It had no capability to index along anything but time
so every query, for my use case, required a full table scan. 0.9 was supposed
to fix every issue I had with it but it was due to be 'production ready'
months ago.

Unless you're dealing with ingesting an absolutely insane amount of data
indexed along time, I'd have to say that postgres or comparable sql database
should be more comfortable, more stable, and much more mature.

EDIT: I don't want to come off as shitting all over influxDB, to its credit it
barely moved beyond idle resource usage when I was stuffing it full of data.

~~~
tsdbase
I've hit the same problem and I would like to move back to a SQL data store.
However none of the nice dashboards / visualizations support postgres or any
SQL database (for now)...

My question (to everyone): what do you use as replacement for kibana or
grafana?

~~~
robochat
I've just implemented a custom backend for graphite-api which seems to be
working ok although I don't have crazy requirements.
[https://github.com/brutasse/graphite-
api](https://github.com/brutasse/graphite-api) is a cleaned up fork of
graphite (which is much easier to install). I'm using grafana as the front-end
and my data is in a postgresql database and graphite-api is linking them
together.

~~~
pierluca
Hello, I find myself having the same need. Would you agree to share your
implementation or point me to it? Thank you!

------
bbrazil
Prometheus uses a file per timeseries, with two levels of delta encoding to
keep the data small. This is our second major storage iteration, and seems to
be doing pretty well with a single server able to handle over 2M timeseries.

See [http://prometheus.io/docs/introduction/faq/#why-does-
prometh...](http://prometheus.io/docs/introduction/faq/#why-does-prometheus-
use-a-custom-storage-backend-rather-than-some-other-storage-method-isn-t-the-
one-file-per-time-series-approach-killing-performance)

~~~
nwmcsween
I don't understand why Prometheus uses leveldb to begin with though, why do
you need it vs using a const offset of an interval? From reading it seems it's
used for an index, I'm guessing because intervals can be variable?

~~~
jrv
LevelDB is only used for indexes to look up series files by sets of
dimensions, not for time-based lookups. We just need to find the right time
series that are relevant for your query.

As a simple example, we have one LevelDB index which has single label=value
pairs as the keys and as the LevelDB value, the identifiers of the time series
which have those label=value dimensions. If you now query for e.g. all time
series with labels foo="biz" AND bar="baz", we will do two lookups in that
index: one for the key foo="biz", and one for bar="baz". We will now have two
sets of time series identifiers which we intersect (AND-style matching) to
arrive at the set of time series you're interested in querying. Only then do
we actually start loading any actual time series data (not from LevelDB this
time).

------
ignoramous
Related:

Baron Schwartz article on Time-series Database Requirements is quite a read:
[http://www.xaprb.com/blog/2014/06/08/time-series-database-
re...](http://www.xaprb.com/blog/2014/06/08/time-series-database-
requirements/)

HN discussion:
[https://news.ycombinator.com/item?id=9166495](https://news.ycombinator.com/item?id=9166495)

~~~
rch
I'm not sure it's universally correct to dump authorizations and visibility
from the up-front requirements. It might be regarded as another aspect of
dimensionality, but that could to missed opportunities for optimization.

------
engi_nerd
I'm always amused when I see the criterion for a "very dense" time series as
data being collected more than once per second. In my business (telemetry), we
often record parameters thousands of times per second, depending on what we
are trying to measure.

~~~
elektronaut
I've got a system here capable of collecting data close to a hundred thousand
times per second from tens of sources, with perceptual real time (< 10ms)
processing and aggregate monitoring.

It's called multitrack audio recording, and commonly runs on your run of the
mill laptop. HD video would probably be a few orders of magnitude more data
than that, with even more processing.

Computers are really, really good at these kinds of things if only the
software is efficient enough.

~~~
engi_nerd
Your last sentence is well taken, but the rest leaves me a bit puzzled about
what you meant.

Your run of the mill laptop is a wonderful machine capable of amazing things.
But that laptop is not rated to withstand extreme vibration and shock and huge
temperature ranges. Nor does it have the external interfaces to deal with
large numbers of custom, high speed digital data buses, nor does it contain
the signal conditioning hardware to deal with a huge array of transducers
working with differing physical phenomena, each with its own unique power and
signal processing requirements. The system I support has to deal with all of
these things, plus more constraints than are relevant to the conversation
here.

Much (but not all!) of this doesn't even require a general purpose computer.
It can be done with state machines implemented in FPGAs.

Sometimes you need something different than a run of the mill laptop.

~~~
elektronaut
I apologize if it came across as a snarky comment or one-upmanship, I merely
intended to support your original comment.

~~~
engi_nerd
No problem.

Like I said, you do have a very good point. I am often frustrated by the
realization that my laptop is much more powerful than this specialized data
collection equipment. But we have a very large fixed base of equipment, with
no plans to upgrade anytime soon.

------
jnordwick
Really? The big boy in the field KDB+ isn't mentioned? Kx's database is pretty
much the gold standard for performance in time series, historical and real-
time.

[http://kxcommunity.com/](http://kxcommunity.com/)

~~~
kvcc01
I use kdb/Q at work and it’s a fun tool to play with so long as someone else
is paying for it. It is quite common in finance (and comparably uncommon
outside of it). It’s very expensive of course, and the learning curve is hard.
In fact, there are plenty of businesses that have sprung up around kdb that
offer consultancy services to help you get started. In an unusual maneuver,
one of these consulting businesses actually ended up buying majority of Kx
Systems, the developer kdb/Q. Anyway, if you know your Q and C++, you will
always have a job in finance.

Part of the reason why it’s hard to learn (unless your job depends on it so
you are forced to persist) is that the syntax is very terse. Check out this
Java API for example:
[http://kx.com/q/c/kx/c.java](http://kx.com/q/c/kx/c.java). Yes, that’s the
actual code you copy-paste into your Eclipse to get started.

~~~
jboggan
32-bit version is free to play with nowadays: [http://kx.com/software-
download.php](http://kx.com/software-download.php)

------
rdtsc
> Aggregation and expiry start to look a lot like dimensionality: they can be
> implemented asynchronously in a separate policy layer.It doesn't seem
> important that the actual storage engine is mindful of these; in fact, it's
> probably better ignored for efficiency's sake.

I wrote my own time series database and actually having well working expiry
was harder than expected. I sharded each one of the files into 2GB or 24h
(whatever came first) chunks. Then had to be careful how they got deleted.
There were various rules there such as ("make sure to not fill partition more
than 80%", "stop when partition is full" or "keep data no more than 90 days").

My number of different time series was actually pretty low but each value was
a record with complex data inside it i.e. not just simple ints or floats. For
example, one series was capturing network packets so it looked like [
(timestamp,packet), (timestamp, packet), ... ] . But then an indexing service
was running separately in a separate OS process to generate additional complex
indexing from the primary data.

~~~
MichaelGG
What industry or app was this for? I wrote something that sounds like this for
VoIP packet capture. Though the rates were around 5TB/day. I'd keep the latest
indexes in RAM before delta-encoding and writing out. Periodically, merge
multiple indexes into one.

Worked surprisingly well, and the indexing overhead was only a few bits per
packet (given enough similarity in packets over a short chunk of time (say, 1
minute), then just indexing unique values per chunk -- it was all fairly
efficient.)

~~~
rdtsc
VoIP capture was one of the functions in the system too. But it had others. It
looks like your implementation is more sophisticated, I didn't do any delta-
encoding. Just periodic fsyncs. The maximum we tested was up to 1TB/day.

Main index ( [(timestamp,offset_in_data)] was simply synced every second. It
just looked like a [uint64|uint32|...]. Then there is an advanced service with
different plugins, runs in a separate process, trails the main data stream.
Each plugin can then write out its own custom index data to file and is also
saving its internal state (so it can recover during a hard crash).

~~~
MichaelGG
Eh well mine was mostly a hack to index SIP packets. Other existing products
just shoved every packet into a MySQL table. One column per field, with
indexes. You can imagine that does not scale very well. So I ended up doing
similar to what you describe.

Fun how designs converge. You'd think there'd be some generic libraries for
this kind of stuff. Now there's LevelDB and its SSTable is sort of close but
uses Snappy so it can't operate on compressed data.

What's the common library that implements fast, compressed, immutable key
values without lots of per record overhead?

------
lnkmails
[https://github.com/rackerlabs/blueflood](https://github.com/rackerlabs/blueflood)
is a time series datastore with aggregation and rest APIs built on top of
Cassandra. It's production ready and in fact used in production at Rackspace.
Full disclaimer: I am a core contributor.

------
deathflute
Well, all you need to do is to look at KDB+ from kx.

All the other products that you mention are for children ;)

~~~
jnordwick
I know the Kx people pretty well, and they are trying to get the word out
(have been for years), but it never ceases to amaze how little respect they
get in the free software world. They are the leading timeseries database, and
yet they don't even get a footnote in the article :(

~~~
jimmcslim
Possibly because they cost an arm and a leg (or at least that's the
perception) and are therefore out of reach of most firms, apart from large
utilities and hedge funds, and the language looks like line noise.

Yes, I know there is a free version, but limited to 32-bit only (and probably
non-commercial?).

EDIT: 32-bit version can be used commercially.

~~~
deathflute
Agree about the cost. However, I would think that a wider adoption would
eventually bring the cost down and perhaps even spawn a bunch of related open-
source projects.

As far as readability is concerned, q(KDB+) is far more readable than k(KDB).
Also, nobody stops you from adopting a coding style that is more readable.
That is what I personally do.

------
olidb2
OP works on the time-series processing backend at Datadog
[http://datadog.com](http://datadog.com)

------
necubi
I've been pretty happy with OpenTSDB (a TSD built on top of HBase). It happily
ingests tens of thousands of points per second, supports full-resolution
historical data, and has reasonably fast queries.

The main downside is HBase is operationally complex, but if you've already
made the investment there (as we had), it's a great option.

------
sciurus
If you are looking for time-series databases based on Cassandra that you can
use with Graphite, check out

Cyanite: [https://github.com/pyr/cyanite](https://github.com/pyr/cyanite) and
[https://github.com/brutasse/graphite-
cyanite](https://github.com/brutasse/graphite-cyanite)

KairosDB:
[https://github.com/kairosdb/kairosdb](https://github.com/kairosdb/kairosdb)
and [https://github.com/kairosdb/kairos-
carbon](https://github.com/kairosdb/kairos-carbon) and
[https://github.com/Lastik/KairosdbGraphiteFinder](https://github.com/Lastik/KairosdbGraphiteFinder)

~~~
johngd
Have you used cyanite in any meaningful way? The original author's (pyr) repo
has been pretty dead.

This person has been doing a lot of good work:
[https://github.com/mwmanley/cyanite](https://github.com/mwmanley/cyanite)

~~~
sciurus
I've considered it but haven't made the jump yet. Pyr gave a presentation a
few weeks ago that suggested his company is already using it and further
development is coming.

[https://vimeo.com/131581325](https://vimeo.com/131581325)

------
nwmcsween
Why does every TSD seem so overly engineered and for all the wrong reasons?
Why not just use a time decaying ring buffer (multiple buffers could be used),
one statistic one file (or more depending on the decay) and offset by a set
interval if you have irregular intervals 'smooth' it to fit O(1) for most
things. My other issue is (from a glance) looking at some TSD they ignore most
research done on how to effectively store the data.

Too many writes, use a cache to batch. No metadata? put the metadata in a
different file and point to the offsets.

~~~
zaphar

        Why not just use a time decaying ring buffer 
    

Because you don't want to through the data away. RRDTool and Whisper implement
ring buffers but you lose data and resolution with them. If that's acceptable
then absolutely use those tools. If you don't want to lose data though then
you need something else.

~~~
nwmcsween
You don't have to decay, I just assumed eventually you would want to decay the
data but in a separate ring buffer, thus no resolution loss.

~~~
bigger_cheese
This is how things work at the industrial plant where I work. We dump a heap
of data out of PLC's at basically whatever the native frequency the instrument
can log it hits the first datastore (which is essentially a flushed ring
buffer) where it might get thrown at a HMI display or something like that
after that it decays into the slower historical archive where it can get some
metadata added to it and get rationalized aggregated, batched together or
whatever.

It varies but datastores tend to be 1 hour worth of data, then 3 days, then 3
months, then permanent. As you move between datastores latency to access a
timestamp increases.

We tend to call Time Series databases "Data Historians" its a big industry and
from what I can tell most commercial products are built around ring buffers.

------
jmoiron
A few comments on some things I'm seeing in these.. comments.

\- What about Cassandra/PostgreSQL/Redis?

One of the implications of "99% of data is never read" is that it's incredibly
wasteful to keep it all in memory. You might be assuming that you are letting
the data expire eventually, but I'm actually not; expiry is a secondary
concern to storage.

Once you start to involve the disk (PostgreSQL and Cassandra), you start to
get into locality issues that these databases weren't really designed for.

For a more concrete description, lets say I have 2000 machines. Our app is
Python, so they run 16 dockerized processes each, with each container
reporting 10 simple system metrics every 10 seconds. These metrics are locally
aggregated (like statsd) into separate mean, max, min, median, and 90pctile
series. I've not instrumented my app yet and that's already 160k writes/sec on
average; if our containers all thunder at us it's 1.6m, and frankly this is
the "simple" case as:

* we've made some concessions on the resolution

* we only have dense series

Anyone who has used graphite at scale knows that this is a really painful
scenario, but these numbers are not particularly big; anywhere you could take
an order of magnitude there are a few other places you can add one.

I'm also assuming we are materializing each of these points into their own
timeseries, but that's more or less a necessity. It gets back to the locality
issues; If we wanted to see the "top 10 system load per node", it's actually
quite imperative that we aren't wasting IO cycles on series that _aren't_
system load; we need all we've got to handle the reads.

(As a side point, this is why people in the cloud are adopting so much Go so
quickly; it's proven to be easy to write AND read, and also to reduce orders
of magnitude in several of the dimensions above, eg. "we can use 1 process per
box not 16 containers, and we can get by with 1000 machines not 2000." Having
to write your own linked list or whatever doesn't register in the calculus.)

\- 1s resolution isn't dense:

No, not always. It's hard to please everyone with these things. In my world,
1s is good (but not great), but 10s seems to more or less be accepted
universally, and much sparser (5m, 30m, 6h) is not actually uncommon. At the
other end of the spectrum, you can be instrumenting at huge cardinality but
very sparsely (think per-user or per-session), perhaps only a few points per
timeseries, and the whole of what I've described above kinda gets flipped on
its head and a new reality emerges. For what I've described, I quite like the
Prometheus approach, but for my very specific use case 1-file-per-metric only
beats the filesize block overhead often enough for very long timeframes; to
long.

\- Why are all TSDB over-engineered?

I hope some of the above has made explicit some of the difficulties in
collecting and actually making this data readable. I've only actually thusfar
discussed the problem of "associate this {timestamp,value} pair with the
correct series in storage"; there are also the following problems:

* you can't query 3 months of 1s resolution data in a reasonable amount of time, so you need to do rollups, but the aggregations we want aren't all associative so you have to do a bunch of them or else you lose accuracy in a huge way (eg, if you do an avg rollup across min data, you flatten out your mins.. which you don't want); this means adding ANOTHER few dimensions to your storage system (time interval, aggregator)

* eventually you have to expire some of this junk, or move it to cold storage; this is a process, and processes require testing, vigilance, monitoring, development, etc.

* you need an actual query system that takes something useful and readable by a human ("AVG system.cpu.usage WHERE env=production AND role=db-master") and determines what series' actually fall into those categories for the time interval you're querying. Anything holistic system that _doesn't_ do this is an evolutionary dead end; eventually, something like Prometheus or Influx will replace them.

These are _minimum_ requirements once you "solve" storage, which is always a
very tricky thing to have claimed. If you get here, you've reached what decent
SaaS did 4 years ago and what very expensive proprietary systems handled 10
years ago.

\- What about Prometheus/InfluxDB/Kdb+/Et al.

Kdb+ is very expensive, its open source documentation is difficult, and its
source is unintelligible. It is basically from a different planet that I'm
from. Even recently, when I encounter people from, say, the C# world and tell
them I work with Python and Go, they ignore Go and say "Wow, there are like no
jobs for Python", which I find utterly bewildering. Of course, I never
encounter any jobs using C#, either. This is how little some of these spheres
overlap sometimes. Someone from the finance world is going to have to come in
and reproduce the genius of K and Q for us mortals in a language we
understand.

As for Prometheus and InfluxDB, I follow these more closely and have a better
understanding of how they operate. I think that they are both doing really
valuable work in this space.

From a storage aspect, I think the Prometheus approach is _closer_ to the one
that I need for my particular challenges than the InfluxDB one is, and in fact
it looks a bit like things we've already had (but also Catena, Parquet, et
al..) For most people, storage actually isn't important so long as it's fast
enough.

And this is kind of the point of my article. There's starting to be a bit of a
convergence among a few Open Source TSDB, and I've taken I've tried to
highlight some issues in those approaches and suggest that there's room for
improvement. I have my own ideas about what these improvements might look like
based on my work at Datadog, and once they're proven (or even disproven)
they'll be published somehow.

------
olviko
What are the benefits of introducing specialized time-series databases vs
using Redis, Casandra, or some Sql database?

~~~
walkingolof
Very few if any if your talking Cassandra (I would never store time series in
redis), I worked developing TSD and are active in the space, most companies
goes with Cassandra these days and builds computational frameworks upon that.

~~~
olviko
Just curious, why not Redis?

~~~
fweespeech
I'm not him obviously but...

If you are at the scale you can't dump it on just 1-2 nodes and call it a day
[which is when you are start looking at Cassandra or a dedicated TSD] ...you
really need 3 DC availability usually and Redis simply cannot do that in any
reasonable way.

~~~
nemothekid
Even if you don't need a dedicated TSD with redundancy - Redis would still be
more expensive to run given that you you would need to keep everything in
memory. Given that you won't read 90% of the data most of the time, it makes
little sense to store all of it in memory

------
ap22213
I'm surprised that PI[0] isn't mentioned in this thread. It's been around
since the 80s and it's ubiquitous in the utility industry for recording SCADA
data.

[0]:
[https://en.wikipedia.org/wiki/OSIsoft](https://en.wikipedia.org/wiki/OSIsoft)

------
andl
A list of various Time-series Databases together with their estimated
popularity can be found at [http://db-
engines.com/en/ranking/time+series+dbms](http://db-
engines.com/en/ranking/time+series+dbms)

~~~
augustl
I wonder why Datomic isn't mentioned.

------
gane5h
We store our event stream data in Elasticsearch. Two features that made it
appealing:

    
    
      * the ingest-side can be scaled up by adding more shards
      * the query-side can be scaled up by adding more replicas
    

To compute rollup analytics, we make heavy use of Elasticsearch's aggregation
framework to compute daily/weekly/monthly/quarterly active users.

From my understanding Postgres has many of these features, but the distributed
features of ES are killer!

~~~
lobster_johnson
We're using ElasticSearch for events, too. The aggregation operators are
really surprisingly fast.

That said, one major downside to ES is that it's not schemaless. You can try
to use the dynamic mapping system, but it will most likely just bite you
eventually, since ES is strict about coercing data types. If your data isn't
completely consistent, it will actually refuse to index it. Any changes made
to your schema also requires reindexing. (For some reason ES can't do in-place
indexing, despite supporting storing all the original data in the "_source"
field.)

If your data isn't perfectly consistent, one way to work around the mapping
problem is to append a type name to every field. So instead of indexing
{"user_id": "3"}, you index {"user_id.string": "3"}. This means that if you
get some input data where the user_id is an int, it doesn't conflict because
it will stored in "user_id.int". You have to handle the inconsistency on the
query end, but it's possibly better than micromanaging the index.

------
emmanueloga_
I thought someone was going to mention [http://druid.io/](http://druid.io/). I
end up not using it for anything, but at some point I was investigating time
series DBs and I thought it was interesting.

If I'm not wrong it combines both very fast in-memory OLAP features and what
they call "Deep Storage", which I think is a way to store things on disk for
slower historical analysis.

------
gtrubetskoy
The subject of Time Series has lately been on my mind as well, see my blog
posts on accuracy of Graphite vs RRD, as well as InfluxDB storage:
[http://grisha.org/](http://grisha.org/)

I am leaning towards none of the above being the best solution and am in the
process of writing my own (too early to announce yet).

~~~
sciurus
Like I mentioned here [0], whisper isn't the only storage option for graphite.
Another user [1] mentioned blueflood. Have you evaluated any of these
cassandra-based options?

[0]
[https://news.ycombinator.com/item?id=9808035](https://news.ycombinator.com/item?id=9808035)

[1]
[https://news.ycombinator.com/item?id=9808662](https://news.ycombinator.com/item?id=9808662)

~~~
jsmthrowaway
At Foursquare, work was being done to put Hypertable under Graphite. It was
incredible to use (my year+ queries returned in tens of milliseconds), but I
don't know what came of it. Hypertable is criminally overlooked in the
industry, and TSDB is a killer app; a little bit of glue code and you've
basically invented a crude clone of Google's monitoring, with that stack.

------
jimmcslim
It seems that the vast majority of these open-source TSDBs are focused on
fairly technical event stream data arising specifically from IT
infrastructure... is anyone using them with success in more business-oriented
domains, e.g. energy meter data, stock trading, other telemetry?

~~~
lobster_johnson
We were using InfluxDB for web analytics (generated by user interactions such
as "viewed page", "viewed product", "refined search", etc.).

0.8 way okay, but very slow and unstable. But in 0.9 the data model is very
different and no longer a good fit for this type of analytics (it looks great
for devops metrics, though), so we're abandoning InfluxDB altogether.

We're now in the process of migrating to ElasticSearch, which is looking much
better. ElasticSearch has its own problems, though, and we will be evaluating
the same dataset in PostgreSQL soon.

------
shakil
Handling time-series data with Google BigQuery
[https://cloud.google.com/solutions/time-series/bigquery-
fina...](https://cloud.google.com/solutions/time-series/bigquery-financial-
forex)

------
rodionos
I wonder if anyone would be interested to try a TSDB as hosted service, lets
say running on top of Google Cloud Bigtable, once it moves out of beta?

------
LGBT_2000
What's wrong with just using plain SQL tables? How big of a dataset are we
talking supporting here?

------
edwinnathaniel
Anyone using SAP HANA for Series Data?

------
KyleBrandt
At Stack Exchange our monitoring system bosun
([http://bosun.org](http://bosun.org)) can use different time series databases
as long as they can be bent into tag key+tag value models. Currently it works
best with OpenTSDB, but can also support graphite (and elasticsearch populated
by logstash). InfluxDB query support is in a branch, but don't want to merge
until we have a devoted Bosun+InfluxDB maintainer since we don't use it at
Stack currently.

Based on that experience, plus from conversations at Monitorma the other week
here is what I think of the current state of some various TSDBs are. Some of
this might just be lies or rumor - so take it at at that:

* OpenTSDB: Requires HBase behind it, so that can be a pain for people. Maintenance on it is sparse, it doesn't seem like the project has a shortage of contributors with the time needed. Stability isn't great (connection errors from time to time, having alerts based on querying OpenTSDB highlights this). Aggregation and downsampling don't behave as expected. For example rate derivatives happen too late in the order of operations - linear interpolation can be strange. Also to query metric with anything many tag combinations over more than a recent interval of time (say a month or more) is basically impossible - OpenTSDB memory blows up and GC dominates. This requires one to create additional metrics that are denormalized for this. This is kind of okay because OpenTSDB is incredibly storage efficient at storing time series data. No support for NaN. OpenTSDB has quite a bit of serious users [https://github.com/OpenTSDB/opentsdb/wiki/Companies-using-Op...](https://github.com/OpenTSDB/opentsdb/wiki/Companies-using-OpenTSDB-in-production). It can ingest a lot of metrics at a high rate without issue.

* KairosDB: Not much experience here. From what I gather it is like OpenTSDB but for Cassandra. Someone mentioned that they thought they heard some core devs have gone to work at Influx which might be concerning - but I don't know if that is true. But same issue of having to run Cassandra if you don't already.

* Graphite: Very rich query language, but currently not a key / value model. Also is not very storage efficient so the approach is that data gets rolled up after a certain period of time - generally problematic for forecasting.

* InfluxDB: Looks promising, but I heard from multiple people "Tried influxdb - was cool but all my data corrupted and I couldn't recover it" at Monitorama. The general concern at Montiroma was that they are overestimating their stability currently when it comes to a production environment. Based on some basic testing at Stack, we found it to be much slower and take up a lot more space than OpenTSDB.

In summary there is no great choice today. More of a pick your pain and best
fit situation. But I'm really curious what people with actual experience in
these technologies can add to the tradeoffs and am hopeful for the future.

~~~
pauldix
InfluxDB CEO here. Those problems with corrupting data were with the 0.8 line
of releases. But to be honest there are people that have been running that and
0.7 in production for almost a year without problems. Your mileage may vary,
but we're not supporting any releases prior to the 0.9 line.

For the 0.9 set of releases, this is what we're supporting going forward.
There are some queries that cause the server to crash, but as far as I know,
there are no problems that corrupt the database or cause data loss.

We'll be releasing 0.9.1 tomorrow. Every 3 weeks after that we'll be releasing
a new point release in the 0.9 line that will be a drop in replacement.

Each one of these releases will fix bugs, improve performance, and add
features on clustering (last part starting with 0.9.2).

We're starting work on the on disk size with the 0.9.2 release cycle. If it's
ready it'll be in that release in 3 weeks.

Basically, it works now for some use cases and scales. Over the next 3 months
we'll be adding features and optimizing to make it useful for larger scales
and more use cases.

Overall it's still alpha software, which is why we haven't put anything out
there that's called a 1.0 release. However, we're trying very hard to not make
any breaking API changes going forward between now and whenever we get to 1.0.

~~~
mjibson
I performed some OpenTSDB vs InfluxDB comparisons and found that InfluxDB used
almost 20x the storage space and was 3x slower than OpenTSDB for an identical
data set. The speed isn't that big of a problem and I'm convinced will get
faster (esp with the write improvements in 0.9.1), but the space issue is
harder to swallow.

HBase, which is what backs OpenTSDB, can compress data using LZO or snappy.
Uncompressed, our test data was at 4GB, but went down to about 400MB after
HBase compressed it. InfluxDB was using 8GB. OpenTSDB has done a lot of work
to be byte efficient, and it's paid off. We hope InfluxDB will get to a
similar place.

~~~
pauldix
Yep, compression work is starting in the 0.9.2 release cycle. We'll be testing
out those compression methods along with other stuff like delta encoding

