
Elasticsearch as a Time Series Data Store - trampi
https://www.elastic.co/blog/elasticsearch-as-a-time-series-data-store
======
discordianfish
That's great for unstructured data, like data with high cardinality on the
dimensions. But for most real world metrics outside analytics, this isn't
necessary and a data model like prometheus makes more sense. If I did the math
right, even after compression elasticsearch uses 22 bytes per data point (23m
points / 508 megabyte) where prometheus uses about 2.5-3.5 bytes per data
point.

Disclosure: Prometheus contributor here

~~~
existencebox
Given your background, I'm going to take this opportunity to ask some very
noobish questions. (I will be doing my due diligence to read up on Prometheus,
I likely should have done so prior, but these are just some off the top of my
head that I'm not sure I'd gain a good intuition for until late in the
learning process)

\- How is the bulk of this additional compression derived? Is it explicitly
the existence of a data model that lets you use more aggressive/intelligent
compression strategies?

\- Does this come at a cost? (increased CPU overhead, latency at read time,
something like that.)

~~~
chaotic-good
If you can predict next value in the sequence you can achieve better
compression (xor predicted value and actual, not previous and next -
[http://users.ices.utexas.edu/~burtscher/papers/dcc06.pdf](http://users.ices.utexas.edu/~burtscher/papers/dcc06.pdf)),
but I don't think that Prometheus developers are using this.

And BTW, I don't believe in 2.5-3.5 bytes per data-point somewhere outside
synthetic benchmarks.

------
kiyoto
While I'm happy to hear about a great success story of a great piece of open
source software, Elasticsearch has done a great disservice by making
application developers lazy about learning the ins and outs of various
analytical/transactional/storage backend systems.

Echoing other commenters, Elasticsearch is hardly the best tool for many kinds
of analytics. In fact, it is strictly not a good tool for several use cases.
For starters:

1\. It's not good at joining two or more data sources

2\. It's not good at complex analytical processing like window functions (for
example to calculating session length based on the deltas of consecutive
timestamps partitioned by user_id and ordered by time).

Of course, it's also good at many things like simple filtering and aggregation
against "real-time" data. Being in-memory really helps with performance, and
with right tools, it's horizontally scalable. Elastic's commercial support is
also not to be discounted.

However, as an old OLAP fart who spent years optimizing KDB+ queries, I am
deeply concerned about the willful ignorance of data processing systems that I
see among Elasticsearch fans. Just take my word for it and study Postgres
(with c_store extension) and other _real_ databases, in-memory or otherwise,
open-source or proprietary, so that you won't be shooting yourself (or future
co-workers) in the foot, trying to shoehorn Elasticsearch and its ilk into
suboptimal workloads (To be fair, I see a similar tendency among Splunk
zealots).

~~~
burntsushi
> Of course, it's also good at many things like simple filtering and
> aggregation against "real-time" data.

And also fulltext search at scale, which is basically its primary use case.

PostgreSQL's fulltext search isn't quite at the same level. The last time I
looked into its capabilities, it didn't fully support TF-IDF. (I don't think
it keeps track of corpus frequencies for terms.) Interestingly, I think
SQLite's fulltext support does include TF-IDF, but I could be misremembering.

I mean, the Elasticsearch docs are pretty clear that joining doesn't work well
(or really, at all). I'm not sure how being clear about the trade offs of your
software is "doing a disservice." Sometimes you don't need to store relational
data. Sometimes you do need to store relational data, but the other benefits
of Elasticsearch outweight shoehorning relational data into what is
effectively a document database.

If your only complaint is that people misuse software... Well... Yeah. It's
been happening for a while now. We should help educate others. I'm not sure
your approach is the most constructive.

------
yeukhon
Worth mentioning that Elastic.co has a commercial product called Watcher [1]
which I think is a really nice way for making an automated alert system. The
downside is being a commercial product I can't use Watcher and would have to
implement one myself.

I am still deciding between ES, a relational database and Cassandra for time
series data. We use graphite now and are happy with it, but I think having a
single database handling logs, events and metrics data would be much more
ideal. Having logs already in ES does make ES a better choice.

[1]:
[https://www.elastic.co/guide/en/watcher/current/index.html](https://www.elastic.co/guide/en/watcher/current/index.html)

~~~
chrstphrhrt
We used elastalert on a project and it did the trick:
[https://github.com/Yelp/elastalert](https://github.com/Yelp/elastalert)

~~~
yeukhon
Thanks... and Python... I don't have to reinvent something... will take a look
thanks!

------
sciurus
The 2.5 release of the time series focused dashboard Grafana added support for
Elasticsearch. In a way they've come full-circle, since Grafana started
several years ago as a fork of the Elasticsearch dashboard Kibana.

[http://grafana.org/blog/2015/10/28/Grafana-2-5-Released.html](http://grafana.org/blog/2015/10/28/Grafana-2-5-Released.html)

------
badlogic
So many ways you can abuse Lucene :) Many years ago, we used it as a graph
data storage as well.

~~~
boomzilla
And I used Lucene as the data backend for an inference engine recently :) If
there is one library that keeps me in the Java world, it's Lucene. No other
open source comes close to Lucene in its category.

------
mbajkowski
We use elastic search in a very similar manner as described in the article to
store high-frequency data for our instance and multi-cloud profiling /
benchmarking tool:

[https://profiler.bitfusionlabs.com](https://profiler.bitfusionlabs.com)

Since we are collecting data at sub-second granularity and did not want to
introduce noise on the profiled instances themselves, whether it be for cpu,
mem, or disk, we had a play a few tricks about how to collect data and when to
precisely send the data to elastic search, but in general it has been working
out very well for us.

------
pnachbaur
I tend to think of Time Series data as being several orders of magnitude
larger than 23 million data points per week (38 per second) but now I can't
seem to find a good definition of Time Series data. Anyone have thoughts on
the rough threshold between event data and time series data? I think of arrays
of hundreds/thousands of individual sensors that take 10 measurements a second
as "different" than user generated data that is time-ordered.

~~~
hcrisp
I agree, time series should be more like 1000 measurements taken 100 times a
second. Industrial acquisition data is not the same thing as timestamped web
log data.

------
lafar6502
??? elasticsearch good for everything.. looks like the computers got cheap and
fast enough to do almost anything. Why not put the data in sql database? I
suppose this will be much better

But nothing seems strange when in order to monitor 1 server you have to run 10
machine cluster with elasticsearch log collector

