
A Prometheus fork for cloud scale anomaly detection across metrics and logs - talonx
https://www.zebrium.com/blog/a-prometheus-fork-for-efficient-cloud-scale-autonomous-monitoring
======
mk_
Did you raise the challenges you had with Prometheus to the dev team and
community? They are usually quite responsive and open for contributions. I'd
be interested in the discussion around the reasons you brought up for forking.

~~~
chaps
Not OP, but one of the biggest issues that I've had with prometheus is that
doesn't support the backfilling of timeseries data. Bringing it up to the
prometheus devs led to oddly indignant responses that they had no intention of
supporting backfilling, because that's not what prometheus was designed for.
In large production systems, like prometheus is meant for, it was a very
disappointing missing feature, and I could absolutely see why someone would
want to fork if nothing else, just to get away from that level of indignancy.

~~~
hnarn
I don't know if the Prometheus devs care about it being an enterprise level
monitoring software, but not having any type of backfilling on the road map
pretty much disqualifies it. Accurate reporting is not a nice-to-have for
enterprise grade monitoring.

~~~
SuperQue
Backfilling data is on the roadmap. It's already possible, the tools are being
worked on.

Also, accurate reporting and backfill are not related.

~~~
chaps
It's been on the road map for _at least two and a half years_ , so that's not
very convincing:
[https://web.archive.org/web/20170708101602/https://prometheu...](https://web.archive.org/web/20170708101602/https://prometheus.io/docs/introduction/roadmap/)

Curious to know why you don't think accurate reporting and backfilling aren't
related. In my experience, they absolutely are -- mostly during disaster
scenarios, where that information is more critical than _any_ other time.

------
manigandham
Yet another Prometheus/time-series backend project.

And yet again, it would be far better to just export the data from Prometheus
into a distributed columnstore data warehouse like Clickhouse, MemSQL, Vertica
(or other alternatives). This gives you fast SQL analysis across massive
datasets, real-time updates regardless of ordering, and unlimited metadata,
cardinality and overall flexibility.

Prometheus is good at scraping and metrics collection, but it's a terrible
storage system.

~~~
SuperQue
Yup, and that's kinda intentional. The design is to be a monitoring system,
not a generic TSDB. What it does needs to be simple, fast, and reliable so
that your alerts get sent out.

The original design inspiration, borgmon, also was a terrible storage system
and had an external long-term store layered on top of it.

This isn't a design flaw, it's an intentional trade off to make the core use
case as bulletproof as it can be. Having seen "monitoring systems" based on
something like Cassandra, aka distributed storage, is cringe inducing. The
first thing to crash a the first sign of network trouble is distributed
storage.

~~~
manigandham
My point is that monitoring != storage and there are plenty of great storage
systems to use so there's not much reason to create another one. For some
reason developers love to create home-made (time-series) databases.

------
DeltaSigma
Alternative title: "We layered compression on top of the Prometheus remote
storage adapter"

~~~
nak923
Zebrium employee here: We could not use prometheus remote storage adapter
because of the following issues: 1\. by the time you get there, it looses lot
of information (like metric type, help, out of ordered drops etc..) 2\. you do
not have control over when it gets sent out, as it has to come through tsdb
after chunking, this adds lot of latency. 3\. It is per time series based
protocol, not good for sending all the samples of a scraped target as a single
chunk. Sending as a single chunk, helps you to group similar metrics at the
remote side and simplifies the protocol for reducing the network bandwidth.

hence we did not use remote storage adapter. We introduced a new interface
that plugs into the scraper directly.

------
conjectures
Anyone know what statistical models Prometheus uses?

Having scanned this article, read the wikipedia entry and hit the landing
page, I am none the wiser.

~~~
DeltaSigma
There isn't much in terms of modelling built into the tool itself because that
kind of functionality is secondary (the tool is primarily focused on quick and
efficient gathering of metrics). You can do some basic forecasting at query
time (holt-winters and linear regression), but beyond that you're using one of
the remote storage adapter plugins to send the metrics to a remote time series
database and the apply your ML on that (like OPs service does)

[https://prometheus.io/docs/prometheus/latest/querying/functi...](https://prometheus.io/docs/prometheus/latest/querying/functions/#predict_linear)

~~~
conjectures
Figures, a fine objective, but not what I'd imagined.

------
sagichmal
Anomaly detection is snake oil.

~~~
jcims
Datadog figured it out. It's pretty great. Obviously it doesn't see
everything, but it's a very useful signal. Their anomaly correlation is even
cooler.

