
InfluxDB Clustering Beta and Data Explorer - pauldix
https://influxdata.com/blog/influxenterprise-beta-clustering-monitoring-influxdb/
======
perlin
We evaluated InfluxDB for our TSDB solution and found it to be generally high
performance (up to about 100k unique series). Kapacitor also had some
promising features for realtime aggregations, but was ultimately pretty buggy
and didn't have support for out-of-order event processing. All in all, it
looks like a great suite of products and we're curious to see how it evolves
over the next few years.

Like many other users here, we were disappointed about paid clustering, but
when the original press release said $400, we were willing to wait it out and
see. However, we ultimately decided to go a different direction after seeing
they wanted $20k+ to run clustering on a 256GB node. We ingest 10s of millions
of data points per day from IoT sensors, and expect our data size to far
exceed that capacity.

That said, we plan to run our own Cassandra cluster w/ KairosDB
([http://kairosdb.github.io/](http://kairosdb.github.io/)) acting as a read /
write abstraction layer. It'll cost us about $11k to run the cluster for the
year, with 3 nodes @300GB/ea., leveraging Cassandra's (free) and open source
clustering, HA, and replication technology.

~~~
CoffeeDregs
+1 for Cassandra and KairosDB. We use Instaclustr to host Cassandra on AWS and
do time series data management with Kairos. It's been a hugely scalable,
robust system we use ingest about 5B datapoints per day.

Also, Brian (project leader of KairosDB) agreed to a small consulting project
for us to help us configure and tune Kairos. He's a very nice guy and was very
helpful.

------
linsomniac
I have just shut down 95% of my collectd/graphite infrastructure after
migrating it over to InfluxDB+telegraf+grafana. I'm loving it! Since shutting
down collectd, system load and io wait time across my fleet has gone way down,
and available CPU has gone way up.

Though I wouldn't say it was a smooth transition. I started with 0.8, IIRC,
and while it worked ok it used an amazing amount of storage. 4GB for a year
worth of graphite data blew up to 100GB for a month of InfluxDB.

I gave up on InfluxDB a few times during the process, but at 0.11 I tried it
again and is has been pretty good. We are only putting the Telegraf data and
one small service statistic in it, but the storage is pretty reasonable at
12GB for a few months of data. Querying and graphing the data with Grafana is
great.

If you have tried it before 0.11, definitely try it again. The guys giving a
Prometheus talk at PyCon were really down on InfluxDB, but they hadn't tried
it for 6 months. I was like "Yeah, it was unusable then". I wanted to like
Prometheus, but I just couldn't figure out how to feed my data into it.

~~~
el_isma
Have you tried 0.13? How much disk space is it using? It's supposed to be
"better than graphite".

Really curious as I am looking forward to set up InfluxDB (moving from
graphite too). I was going to use collectd, but your comment makes me wonder
if it's the right choice.

~~~
linsomniac
Yeah, I'm currently running 0.13.0.

It's really hard to compare and say better or worse than graphite, because it
isn't an apples/apples comparison. I, so far, haven't figured out how to do
roll up like graphite has built in, so I think I'm holding onto the high
precision data. So that makes it not even close to a fair comparison. Still
though, 12GB for our current data set feels reasonable. Back in 0.8 when it
was more like 150GB for a month, that was not gonna work.

For a while I was feeding collectd into InfluxDB. It didn't really produce
anything I could use immediately, and I haven't gone back to revisit it.

------
zenlot
Great news from influxdata. Though all of these features (clustering,
monitoring, and data elxploration) could be released with standard edition.
The start of InfluxDB looked very promising, not sure if people feel the same
with the chosen marketing model.

~~~
maerF0x0
What other method would you suggest they become a sustainable business?

~~~
pandemicsyn
Prof services, training, dev support, production/tech/enterprise support,
hosted services, managed on prem, fancy management interface (ala Datastax)...

~~~
akbar501
There is a lot of discussion going on currently around OSS business models.
The general consensus is that support contracts / professional services /
support do not monetize well enough to grow a large business.

Hosted services is a viable business model. The risk here is to ensure that
your product, if successful, cannot simply be deployed by Amazon. Dual
licensing has been one response to the Amazon risk.

Admin tools alone are also thought to be difficult to monetize at sufficient
value to grow a large business.

While DataStax has a nice UI, they have built a data platform with Cassandra
the core, much like Cloudera has built a data platform with HDFS/HBase at the
core. Personally, I view the data platforms as prepackaged and productized SI
(which is very valuable..

------
mattbillenstein
I really like the influxdb interface -- reading and writing data are super
simple, but in some recent testing I did, it really really hated high-
cardinality data sets.

[https://docs.influxdata.com/influxdb/v0.10/guides/hardware_s...](https://docs.influxdata.com/influxdb/v0.10/guides/hardware_sizing/#when-
do-i-need-more-ram)

Show stopper if you're in the same boat that I am.

~~~
pauldix
After 1.0 is released our two highest priorities are to handle high
cardinality data sets (on disk and in memory inverted index) and continuous
queries that are more automatic (downsample everything and give users the
option for query results that scale automatically like Graphite)

~~~
mattbillenstein
Right on -- for the record, this data is ending up in mongodb at the moment,
so I need to write logic for binning by time and so forth myself. Our use case
is maybe odd in this space -- sampling a handful of values per item a couple
times a day, but for millions of items.

I compared a dataset to postgres and the new mongo storage engine is very good
-- dunno if wiredTiger is something you guys are looking at, but it's very
compact and fast to query with the proper indexes.

Dumped some stats into a gist as well:
[https://gist.github.com/mattbillenstein/89969980025414e2bca8...](https://gist.github.com/mattbillenstein/89969980025414e2bca8325cf503342f)

~~~
oky
i had a similar issue with mongo and time series data. for special purpose
event data like that gist, i think you should consider using a column store.

~~~
mattbillenstein
I may eventually -- we're using mongo for the rest of the metadata for now, so
this is good enough to not need something else at the moment.

