
Systems Monitoring with Prometheus and Grafana - jsulak
https://flightaware.engineering/systems-monitoring-with-prometheus-grafana/
======
dijit
Grafana truly is best in class, but I have strong reservations about
Prometheus.

I really want to like it, it’s just so _easy_, publish a little webpage with
your metrics and Prometheus takes care of the rest. Lovely.

But I often find that the cardinality of the data is substantially lower than
even the defaults of alternatives (influxdb has 1s and even Zabbix has 5s).

Not to mention the lost writes (missing data points) which have no logged
explanation.

All of this, however, was in my homelab, which, while unconstrained in
resources lacks a lot of the fit and finish of a prod system.

I also take pause with the architecture; it’s not meant to scale. It’s written
on the tin so it’s not like I’m picking fault, but when you’re building a
dashboard that sucks in data from 25 different Prometheus data sources, it
becomes difficult to run functions like SUM(), because the keys may be out of
sync causing some really ugly and inaccurate representations of data.

Everything about the design (polling, single database) tells me that it was
designed primarily to sit alongside something small. It could never handle the
tens of millions of data points per second that I ingest(ed) at my (now
previous) job.

But it has a lot of hype, and maybe I’m holding it wrong.

~~~
skohan
> Grafana truly is best in class

Really? Recently we've been playing with Chronograf with InfluxDB and most
people find it a lot nicer to work with than Grafana (specifically because it
makes discoverability a _lot_ nicer)

~~~
joseluisq
For our modest cloud infra, InfluxData TICK (InfluxDB, Kapacitor, Chronograf
and Telegraf) stack has fitted exactly with our needs. We really like its
folding building-blocks, interoperability and yeah... easy discoverability and
configuration. But also its very convenient InfluxQL which lets us customize
reports with ease on InfluxDB.

------
kasey_junk
I have a love/hate relationship with Prometheus. If I had no budget for
metrics its likely the thing I would reach for, but I’m dying for someone to
open source a ‘next level’ metrics system (something like Monarch or Circonus
but free).

But woe betide the team that has to run it as a service. Not that other
metrics systems are better but Prometheus can be brutal in that space.

As a ‘squad level’ tool its really good. After that it gets hairy fast.

~~~
valyala
Could you give more context on Monarch or Circonus features that are missing
in Prometheus?

BTW, I'm working on VictoriaMetrics - open source monitoring solution that
works out of the box. See
[https://github.com/VictoriaMetrics/VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics)

~~~
kasey_junk
1) histograms as the basic primitive. 2) bidirectional transport 3) runtime
configurable filtering at source, collection and sync. 4) provenance as part
of the transport.

------
Jedd
We've got a somewhat similar landscape, on a pretty sizeable network - big
investment in Zabbix and looking to move, perhaps slowly and perhaps only in
part, towards Prometheus.

Coming from a monitoring system that supports push and pull with elegant auto-
discovery, we're struggling to work out a sane architecture around
(effectively pull-only) Prometheus.

~~~
EdwardDiego
There's a push gateway:
[https://github.com/prometheus/pushgateway](https://github.com/prometheus/pushgateway)

~~~
Jedd
Yeah, I think we've looked at that. It provides push for the last mile, and I
suppose you could wrangle some auto-discovery using that tooling, but you're
still doing pull from Prometheus to that/those server(s).

We're still a bit stuck trying to replicate all the make-life-easy
functionality we get with Zabbix sitting on a honking great PostgreSQL /
Timescale database, with a bunch of proxies, and automated agent installs that
auto-register.

There's places that _doesn 't_ work well (k8s, f.e.) but for conventional
fleet metrics it's difficult to abandon.

~~~
EdwardDiego
Yeah true, we find it easy for us because we're using K8s annotations for
Prometheus scrape target discovery, so the gateway is just another target, and
we're not running too many ephemeral jobs that we need more than one gateway.

------
ablekh
Interesting post. However, I believe that most content, and especially broad
technical one like this, absolutely needs a balanced amount of relevant visual
elements (e.g., images, diagrams). If you want it to be readable, that is.

~~~
icelancer
Yeah, odd post that talks a lot about Grafana and visualizations then uses
absolutely none.

~~~
ablekh
Yes, it's quite ironic. Hopefully, the authors will eventually improve the
post, because core content is valuable.

------
djmetzle
I thought this article, while a little dry, was very illuminating. It sounds
Hyperfeed is running at the very least "Medium Data" (we all thing our Data is
Big!). And i think it is fascinating to hear of a case where Prometheus is
plainly a bad fit for it's intend purpose. It sounds like cardinality
explosion around their ML models was a really bad fit for Prometheus. Its
great to hear about deployments "in-situ", and people appreciating where it
works well, and where it doesn't.

------
bacondude3
What's a good alternative to Prometheus when pulling stats is impractical? Say
I want to monitor a personal laptop like I would a server. It will change
networks and IP addresses, so pulling would be impractical to configure,
whereas the laptop could easily(?) push its stats to a remote server.

~~~
molecule
Telegraf + InfluxDB?

~~~
abhishekjha
This is my setup on all my raspberrypis. I have not be been able to figure out
how to monitor a cluster though. I saw that grafana free tier doesn’t allow a
cluster of servers getting monitored. I have telegraf + influxdb + grafana
installed on all my servers.

~~~
michaelmcdonald
Could you expand on what you mean by:

> grafana free tier doesn’t allow a cluster of servers getting monitored.

Is there a particular aspect of the cluster you're missing? Is it that you
don't want individual server metrics?

~~~
abhishekjha
I have telegraf + influxdb + grafana-server installed on each of my Rpis
giving me multiple dashboards. I want only one grafana-server dashboard where
all the telegraf metrics could be seen.

------
site-packages1
What do you all do with the collected metrics over time? Do you store
everything forever, drop everything after a couple weeks, or something on
between? I've heard of people thinning out old data a bit (?) and storing it
long term rather than storing everything. What's the usual thing people do?

~~~
sagichmal
High fidelity operational metrics have a useful half life measured in days or
weeks. Read patterns for longer term use cases are also categorically
different. Best architecture is to have a separate system for long term stuff,
which treats Prometheus as a data source. Then Prometheus can drop after
14-28d.

~~~
jldugger
> High fidelity operational metrics have a useful half life measured in days
> or weeks.

Depends on the metric IMO. There's a ton of use you can get out of forecasting
and seasonality for anomaly detection, but you need data going back for that
to have any chance. Many relevant operations metrics exhibit three levels of
seasonality: daily (day/night) weekly (weekday/weekend) and annual (holidays,
superbowls, media events). Being able to forecast network traffic inbound on a
switch to find problems would require you to have 1y of data, effectively. You
_might_ be able to discard some of the data but you'd lose some of the
predictive capacity for say, the Super Bowl.

~~~
sagichmal
I agree that it's important to keep some telemetry data for the long term. My
point is that you need fewer and less granular metrics for those use cases,
and that the access patterns are sufficiently different from real-time
operations, that they're most effectively served by two completely different
systems.

------
cmckn
Prometheus is great. I first heard about it at KubeCon last fall, and kind of
shrugged it off as one of those fledgling "cloud native" projects that I
probably didn't need or didn't have time to learn. There's actually a lot of
adoption, you can find great exporters and grafana dashboards for almost any
OSS you're running today. I started collecting metrics from Zookeeper and
HBase in about an hour, having never had access to that telemetry before. From
the existence of Cortex[1], it seems that Prometheus doesn't scale incredibly
well, but I don't think many users will hit these limits.

[1] [https://cortexmetrics.io/](https://cortexmetrics.io/)

~~~
edoceo
My Prometheus system is a $10/mo Linode. It collects from 27 other hosts, and
at least 100 services distributed across those hosts - doesn't even break a
sweat. All the exporters run through a wireguard VPN. Prometheus is great for
a small/medium SaaS type environment.

~~~
abhishekjha
What do you use as a frontend? As far as I could tell grafana free tier
doesn’t allow monitoring cluster of servers.

~~~
dewey
You could self host it.

~~~
abhishekjha
Can I self host for monitoring cluster of servers? Currently I have grafana
installed on each of my servers and I am having to monitor them individually.
I want a centralised dashboard over telegraf + influxdb.

~~~
detaro
Why would you install Grafana + Influx on each server instead of one central
one?

~~~
abhishekjha
I haven't spent much time on this but most of the docs were for setting it up
on each hosts. Is there a proper tutorial for clusters?

Also I wanted to keep the monitoring unaffected for other servers if one of
them go down. If I setup a central server for monitoring then that becomes a
single point of failure.

~~~
heliodor
Grafana is meant to run as a single instance. For monitoring multiple servers,
you need to get the metrics into one data store, from which Grafana will read.
That's Prometheus' job. These pieces should not be on the same servers that
run your product. For HA, you can run two or more Prometheuses as duplicates,
so you can switch to another one if the main one is down.

~~~
abhishekjha
Would the one datasource have a single database with several tables, one for
each server? Lets say I am monitoring mysql. Currently I have a `mysql` table
in a databse named `telegraf` on each host. Can I combine multiple influx
datasources into a single dashboard beacause that would be easiler right now
for my current setup?

~~~
heliodor
You handle multiple servers by tagging your metrics with the server id or
name. No need to create a table per server.

Each panel in a dashboard points to a specific data source, so you can have
multiple data sources in one dashboard.

------
halfmatthalfcat
Prometheus and Grafana are awesome, use them personally for all my monitoring.

However I’m still trying to nail down my high cardinality/highly unique
metrics-like data story. What are people using?

I’ve heard a combination of Cassandra/BigTable and Spark as a potential
solution?

~~~
latchkey
I found this interesting. My plan is to move from Prom to Victoria.

[https://medium.com/@valyala/measuring-vertical-
scalability-f...](https://medium.com/@valyala/measuring-vertical-scalability-
for-time-series-databases-in-google-cloud-92550d78d8ae)

~~~
sagichmal
Woof, good luck. Not a great product.

~~~
PerusingAround
Care to elaborate? At least a slight mention of why.

------
apihealth
I've been looking into Prometheus + Grafana for other reasons. I have some 3rd
party APIs connected through API gateway, which I need to health check and I
couldn't find other open source alternatives. Gonna move the whole setup to
cloud at some point but I'm not sure if this is the right thing to do. Does
anyone have other articles/ open source tools which can be helpful to me? This
article goes much deeper into how the setup can be used but I'm looking for
more simpler use cases of the same setup, for the task I need to do.

~~~
valyala
Probably vmagent could be useful for your case? See
[https://victoriametrics.github.io/vmagent.html#use-
cases](https://victoriametrics.github.io/vmagent.html#use-cases)

------
osn9363739
Does anyone have anything good or bad to share about using Grafana as a front
end for metrics logged in AWS cloudwatch? I know it has a plug in and I'm fed
up with how bad the cloudwatch dashboards are so wondering if I should check
it out.

~~~
uaas
Well, I’d say give it a try. We are using CloudWatch as a Grafana datasource,
because this way you can concentrate more of your monitoring to one place,
which is useful during troubleshooting. With Grafana 7.x, you can even check
and correlate your CloudWatch logs inside Grafana, deeplinked to the AWS
Console. After this major version, you can even wire Jaeger into Grafana, so
you have a one stop solution for tracing as well (and logging, if you utilize
Loki too).

