
InfluxDB vs. Cassandra for timeseries data - rar_ram
https://www.influxdata.com/influxdb-vs-cassandra-benchmark-time-series-metrics/
======
paulasmuth
The linked article is an obviously bullshit benchmark that makes influxdb look
good and cassandra look bad (by, surprise, the influxdb folks).

I'm far from a cassandra fanboy, but this really is just dishonest marketing.
Not sure if that will work if your product is open source and the target
audience are developers.

Some thoughts:

\- The reason why cassandra uses so much more space to store the same data is
that they've set up the cassandra table schema in such a way that cassandra
needs to write the series ID string for each sample (while influxdb only needs
to write the values). You easily get a 10-100x blowup just from that. There is
no superior "compression" technology here but just an apples-to-oranges
comparison.

\- Then, comparing the queries is even worse, because they are testing a kind
of query (aggregation) that cassandra does not support. To still get a
benchmark where they're much faster, they just wrote some code that retrieves
all the data from cassandra into a process and then executes the query within
their own process. If anything, they're benchmarking one query tool they've
written against another one of their own tools.

\- Also, if I didn't miss anythin, the article doesn't say on what kind of
cluster they actually ran this on or even if they ran both tests on the same
hardware. There definitely are cassandra clusters handling more than 100k
writes/sec in production right now. So I guess they picked a peculiar
configuration in which they outperform cassandra in terms of write ops (given
a good distribution of keys, cassandra is more or less linearly scalable in
this dimension)

\- A better target to benchmark against would probably be
[http://opentsdb.net/](http://opentsdb.net/) or
[http://prometheus.io/](http://prometheus.io/) \- both seem to have somewhat
similar semantics to InfluxDB (which cassandra and elasticsearch do not)

DISC: I also work on a distributed database product
([https://eventql.io](https://eventql.io)) but it's neither a direct
competitor to Cassandra nor InfluxDB nor any of the other products I've
mentioned. I hope the comment doesn't come across as too harsh. The article
raised some very big (and harsh) claims so I think it's fair to respond in
tone.

~~~
brianwawok
I don't understand this benchmark at all. It says performance of a 1000 node
cluster, but then shows 100k inserts per second in Cassandra. Then later
follow up comments say that this test was on a single machine. Without seeing
the schema, 100k inserts / sec is reasonable for a single machine. For 1000
machines it would mean there is a pretty massive configuration issue.

If you are going to benchmark a distributed system, you really need to set up
more than 1 server.

(Disclaimer - work at Datastax)

~~~
paulasmuth
This confused me, too.

I think what they meant with "1000 nodes" is that the dataset they're using
for the benchmark is synthetic monitoring data (where the thing being
monitored are servers).

And the way they generated the synthetic data set is by having 1000
imaginative servers produce one sample per second, (i.e. have a script that
writes out 1000 * duration_in_sec fake samples -- I believe this is the code
that does it [https://github.com/influxdata/influxdb-
comparisons/tree/mast...](https://github.com/influxdata/influxdb-
comparisons/tree/master/bulk_data_gen))

~~~
brianwawok
Makes sense.

Posting 1 node benchmarks of distributed databases seems suboptimal.

------
daenney
The conclusion isn't entirely surprising, "we from X say that engine X is
better than engine Y" but there are many companies that have monitoring stacks
built on top of Cassandra, like SignalFX. They have a presentation or two on
the topic too that might be interesting:
[http://www.slideshare.net/planetcassandra/signalfx-making-
ca...](http://www.slideshare.net/planetcassandra/signalfx-making-cassandra-
perform-as-a-time-series-database)

Ultimately this benchmark will be heavily influenced by the code written to
"emulate" the InfluxDB parts on top of Cassandra and how much of that code
puts Cassandra at a disadvantage. I'd like to hear from some people that have
built such solutions on top of Cassandra what they think about the benchmark
and see how that benchmark would evolve.

------
soundoflight
From using InfluxDB (up to v0.10 I think it was), it's a great database but
performance REALLY depends on the cardinality of your data.

I can't stress it enough, calculate your cardinality before switching over to
it. If your cardinality looks good, InfluxDB is a perfect, logical choice. I
really enjoyed it and it is dirt simple to figure out. We had a junior dev
just out of college with little experience set it up and get a high level of
proficiency in a matter of hours.

Edit: I should point out, I was doing about 10 million records on my db
(hosted on a Mac Mini in development!) a day with a 2 week sliding window. I
was pushing the data from InfluxDB into custom D3 visualizations. I would
cache certain queries in Redis, so I wasn't always hitting InfluxDB with each
read request.

~~~
pauldix
We're working on the cardinality problem. Will be resolved in an upcoming
release. Moving the index over to a disk based format that will hopefully
still be fast and not sacrifice lookup performance.

~~~
bsg75
Can you explain the cardinality problem in a bit more detail? Its come up more
than once in this thread.

~~~
soundoflight
[https://docs.influxdata.com/influxdb/v1.0/concepts/glossary/...](https://docs.influxdata.com/influxdb/v1.0/concepts/glossary/#series-
cardinality)

You want to keep the amount of different data that you are indexing/tagging on
low. As an example with my situation, I was tracking what could be amounted to
connections between nodes in a very large tree. I had a lot of distinct pairs,
which means that I had a high cardinality. When the cardinality increases a
query that used to take a millisecond to load could move to a couple seconds.

~~~
bsg75
So InfluxDB v1.0 has issues with the cardinality of the "primary key" (or
candidate keys) gets high?

At what level of keys or tags did you start to see query performance become
problematic?

------
tychuz
Just looking at the domain is easy to guess which one will win...

------
klucar
Has anyone successfully compiled their benchmark code?
[https://github.com/influxdata/influxdb-
comparisons](https://github.com/influxdata/influxdb-comparisons)

I added code to the data generator to work with Timely
([https://nationalsecurityagency.github.io/timely/](https://nationalsecurityagency.github.io/timely/))
but can't get it compiled.

Also, it seemed that ingest and query were separate stages. Queries should be
run while ingest is running to get real-world performance, but I understand it
is more difficult to test this way.

------
dz0ny
It would be interesting to compare memory requirements, I chose Influxdb
because it had 10 times lower memory usage. The dataset was small (couple of
million datapoints)... but stil

~~~
dx034
That only works when you have one series with a lot of observations. If you
have many series with fewer observations (say 50k per series) influxDB uses
absurd amounts of memory. I had to switch back to Cassandra because I
constantly ran out of memory.

~~~
pauldix
We're working on solving the high cardinality problem. Hopefully soon

------
LogicX
Not sure why this blog post from July made it to the front page now.

Though 1.0 GA is being released today.

