
Ask HN: DevOps, why do people still use Grafana/Prometheus etc.? - antocv
What is the point of &quot;monitoring&quot;, setting up a fancy dahsboard showing some graphs of some time-series data?<p>Ive seen this used only to impress management and to get a &quot;star treky&quot; look in the office. But no body actually stands and looks at a graph as their day-job, nor should they. Nor do alarms go out to people from the Grafana dashboard.<p>Here is the thing, if you can have alarms go out that something is wrong, you have that then why do you need to see that on a graph?<p>I really dont see the point of &quot;monitoring solutions&quot; when any actionable event (even if it is generated by &quot;interpreting time-series data by &#x27;machine learning&#x27;&quot;, can just be an actionable event without showing stuff on a dashboard.<p>Enlighten me devops monitoring folks please?
======
core-questions
> But no body actually stands and looks at a graph as their day-job, nor
> should they.

That's very interesting. Where I work, looking at these graphs is part of our
day to day duties: we need to know how our systems are doing, and alerting
conditions have not yet been defined to cover every possible thing that could
go wrong. Typically, the graphs help us point out conditions we need to watch
for, which are often a combination of multiple things happening at the same
time which necessitate human action to resolve.

Graphs also help us on the business side because we can see what happens with
the utilization of our service after various marketing efforts, launches,
promotions, etc. and while there are BI tools to do this, they often suck in
many ways compared to Grafana, so it's usually a better bet to just stick in a
place where the data can be viewed idiomatically.

Last but not least, without the ability to look at graphs, how do you know all
of your monitoring is working and configured well? It's not enough to just
have blind faith in the system, you need to check ever so often to make sure
things are flowing well.

It's not about it being a fancy star trek dash, but damn if that doesn't
impress management anyway.

------
a-saleh
I have been on L2 and L3 on-call duty in my previous company, and some of the
times these dashboards were life-savers.

Is the queue growing? Certain node not processing all it should? e.t.c

Alerts actually did go out of prometheus.

Hindsight is 20/20 and I remember i.e. having to change alerting on median
aggregate node memory spiking to alert on any of the nodes spiking. Seems
obvious in retrospect, but if we only have alert, I don't really know.

And there were parts where the dashboard wasn't obvious in retrospect, in the
more complicated parts of our data-pipeline,ad there Graphana really shines.

------
ahpearce
As others have said, there may not be an 'event'. Some metrics need to be
monitored manually _before_ setting up an event to trigger. Sure, you might
have engineers analyzing the time series data, but you also need to keep your
systems up. There are multiple failure modes for various services that may
require different action.

For example, perhaps you have some poorly written legacy service that has a
memory leak. Let's just say for the sake of this argument, that any sort of
boolean indicators (e.g. checking if the process is running) will give you an
'Okay' or green. You are still probably interested in monitoring the memory
usage to make sure the service is operating correctly and/or performant. After
monitoring for some time, maybe you determine your ops engineers are taking
some action whenever the memory gets _around_ 80% or something... then you can
setup the trigger event. But without that manual monitoring upfront, you can't
just magically set that threshold.

------
PaulHoule
It depends what you are debugging.

I have a smart home project at home and I learn frequently that this is
something that normal people will fail at.

I have a sengled switch that connects to a smartthings hub, calls a lambda
function, posts a message to an SQS queue which my home server drains and
pushes into rabbitmq.

I found that if I didn't use the switch for a while (say hours) I would push
it and wait 20 seconds or more for it to turn on (maddening because you might
not have faith that it will change which will make you push the button and
send more events...) It was reliable, but slow.

I got timestamps from as many parts of the system that I could, made graphs,
and that led me to add a heartbeat that kept the lambda and queue active and
also to switch from a fifo queue to an ordinary queue. Between those two steps
the time from Smartthings to activation is in the 200-300 ms, and with the
light configured to turn on instantly instead of fade, it feels responsive.

Note though I was not using Grafana or a tool like that, rather I was working
w/ Jupyter and Pandas. After the system has run for a few weeks I might be
able to do a detailed analysis of the tail latency, but it's not a "dashboard"
I run over and over again unless the problem recurs.

------
tannerbrockwell
Implementing monitoring of key metrics is a requirement for establishing SLOs.
A dashboard may not be observed constantly, and if you have an incident, those
key metrics better be presented in a manner that shows realtime, and historic
performance. Prometheus and Grafana are prevalent because they are robust and
mature implementations. You are correct that dashboards more often than not
are to showoff this capability, and remember I said that you MUST implement
monitoring if you have SLOs.

Think of the Dashboard as a Cherry on the cake. There isn't much point to the
Cherry, but if you bought a cake you better get a cake!

My biggest complaint for dashboarding is that it is easy to ignore some key
components, such as nth percentile beyond 95 which is mostly a capacity
planning target. If you go looking you will find that there are serious issues
in serving your 96-99th percentiles. If you are looking for something to
improve start there.

------
3minus1
Dashboards are great for correlation. After an alarm fires you check the
dashboard and compare all the graphs at the time of the incident. It's a great
way to get more information about what is and isn't broken.

------
oftenwrong
If your alarms always go out when something is wrong, you do not need a graph.

However, alarms are never perfect. Issues can occur without triggering alarms
at all. The graphs cover your blind spots; they help you debug the issues that
you do not yet have reliable alarms for.

If an issue occurs without triggering an alarm, but the graphs help you debug
the issue, then you should create an alarm that would have caught the issue.
Next time, you will get the alarm, and will have a better idea of what is
happening without having to look at the graphs.

------
2rsf
> if you can have alarms go out that something is wrong, you have that then
> why do you need to see that on a graph?

Because the trend the lead to the alarm might be important and only visible as
a graph, for example when did the queue started growing ?

Because you want to fine tune your alarms, or handle near-misses again those
are best visible on a graph.

Or because you want to quickly see some high level behavior of the system,
number of users per hour of the day, errors vs number of users etc.

