Hacker News new | past | comments | ask | show | jobs | submit login
Good and Bad Monitoring (raynorelyp.medium.com)
103 points by kiyanwang on July 13, 2021 | hide | past | favorite | 27 comments



> Bad: HTTP 400 > On the other hand, HTTP 400 level errors mean the client screwed up.

This is bad general advice. HTTP 4xx errors mean the client screwed up, OR you screwed up (a change that e.g. increases 404 rate due to eventual consistency, returns 404 for all content, breaks auth, returns the wrong status code, etc.). Either way the content is inaccessible to the client, the person visiting your website doesn't care if they get a 404 or a 502, they care that the content is inaccessible. Once you get high enough traffic, monitoring 4xx rate is pretty critical to making sure people can actually use your service. (Or monitor the inverse, i.e. a floor on 2xx rate instead of a ceiling on {4,5}xx rate.)


In the context of alerting, I agree with TFA. You should not be alerting on bad request errors, because you might have no control over it. That said you might want monitoring on it so you can check if the rate jumped at some important point (eg, after a deployment) but I wouldn't look at it on a regular basis.

I had something like that on an internal system. The 400 rate would jump all over the place because our edge systems had shitty input validation, and bots would crawl us with broken requests ("can I reserve this item starting last week?" kind of thing) with no rate throttling. After a few years the edge validation (and bot detection) got better, but alerting on that would've been worse than useless.


Yeah I agree that false positives are a risk with monitoring 4xx rate. I've never seen a satisfactory "bulletproof" way to monitor 4xx rate, it's inherently difficult and simultaneously important to monitor.

It's easy in retrospect to say "oh that was a waste of time because it was just bots" but you don't know that until you investigate. I ask myself "if I see elevated 4xx's, at what point do I start to care if they're caused by a bug?" and set monitor thresholds somewhere around there.


The client who's calling wrong might want some help getting it right.

Its empathetic to keep an eye on surges in 4xxs


Sure, but there’s a big world between keeping an eye on surges and waking someone up if it goes out of 3-4 nines.


Author here. This.


Absolutely, I used to work for an online bookmaker and if we saw a spike in 404s it usually meant one of our sports traders had stuffed something up and took down a market early, or a release went out that broke our navigation. In a business that is inherently spikey (i.e. the majority of bets came through _just_ before an event started) we had to be pretty careful about what spikes were good and what were bad.


Monitor the general 5xx error rate so you have high SNR.

Cover mistakes with robust probers that should get 200 and then alert on any non-200 response.


This.

In my experience, 4xx and 5xx are only valuable to find the right place to look, but in no way do indicate if client or server failed.


Inflated error rates due to invalid 500 errors is definitely a thing I've seen it at my last 2 jobs. I think it comes from a lack of confidence with developers when it comes to setting the http status code to something different. It starts with a genuine 500 error caused by an unexpected invalid request the app can't handle and is crashing or some sort of bad behaviour, they put in some sort of exception handling so now it can handle it, they now have code that solves the bug that is impacting the server, but they've still got to return something to the clients request which is still not valid. A 400 error would almost always be appropriate, or perhaps another more specific 4xx would be appropriate but 500 is what was already being returned. Anything else is a change which might impact the client in an unexpected way. It takes a lot of confidence to make a potentially breaking change that goes beyond merely fixing a bug even when you are reasonably certain it's the right change. Once you've got it in your code base a few times to return a 500 after you've caught an exception it starts to set a president that others will follow the example.


One of the most useful things we did was add a button to every alert we sent that said "Was this alert useful: Yes/No".

We would then send the alert creator reports what percent of recipients said yes. That alone got people to realize that a lot of their alerts were unnecessary and get rid of them. As a bonus, the most useful alerts actually got subscriptions from other people on other teams because it was such a useful indicator.


I wish this was a standard feature. My company is really bad at alerts, I just have rules to send them to the trash. I have my own i

I am going to implement this if I ever design an alert system.


I actually really like this idea.


>HTTP 400 level errors on their own do not indicate a problem.

...on the back-end, but they can help find issues with the front-end. For example: I have an endpoint my front-end calls in the background to update a preview for the user, I found out that some changes in the front-end made all calls invalid (but only in production), monitoring 400 errors helped me figure out the issue very quickly before a customer even complained about it. Sometimes, "the client screwed up" because of you, if you have 90% of your clients messing up in the exact same way it may actually be your fault.


During a RCA, you find a specific error message associated with that incident. You deliver a new alert with some documentation about what it catches and what to do. You even generate automatically a ticket when it is raised. Time passes. There's a subtle change if the error message. You have another production incident but your alert hasn't fired. The complexity comes from this: how do you know that an alert is still valid without creating an incident on purpose ?


The most important thing that was missed:

Good: Alerts on business level metrics

Bad: Alerts on machine level metrics

Knowing that checkout volume is sharply down is far more valuable than knowing that CPU on one of the checkout servers is way up. Mainly because that high CPU may have no customer effect, so it's really not all that urgent.


> Good: Alerts on business level metrics

I agree partially. I would make it more general though: alert on symptoms of problems. Those can be business metrics, like the one you suggested in your example, or they can be system level, like rate of errors, or a queue that's growing out of control.

> Bad: Alerts on machine level metrics

100% agree. There is no excuse for that. A CPU working overtime with no customer impact (no symptoms of problems) is an efficient system. I'm paying for that CPU, I'd like to use it. If I get an alert every time I use something I pay for that will only drive me to pay more so it shuts up - even though there was no problem to begin with.


Why not both? Scraping machine level metrics works out of the box with most agents


It's fine to scrape the metrics, but what I'm saying is don't alert on them by default until you are sure that a particular server metric is actually a good alert.


Yes. And also make business metrics somehow traceable to machine metrics.


> I started with a system where another team would publish data to us in an eventing architecture and would frequently publish corrupt data. It was my team’s responsibility to address anytime data was not ingested correctly into our system. As a result, we had floods of errors in our system. We tried asking them to stop and they said no.

In this particular instance, I would simply respond to the caller with an appropriate error code and be on my way. The other team should be responsible with dealing with such an issue. The writer implies they had no choice but I don’t buy it, if you design the system in such a way to not allow corrupt data to begin with, it becomes the callers responsibility to handle these issues.


If it's like pulling teeth to motivate your users to use dashboards, you're building useless dashboards.

Dashboards are great for at-a-glance metrics roll-up. Build small, single-page, targeted dash that answers questions your user asks, and they'll be used. I want to see the lay of the land and know I'm heading into the weeds, instead of being kicked in the shin by a monitoring alert when I'm already in the weeds.


To elaborate on this, dashboards are where you go when you get an alert, to answer questions like:

- how widespread is it?

- who's impacted?

- is the alert the root cause, or just a symptom?

Dashboards should tell a story, not just be a bunch of graphs squeezed onto a page. There should be links to drill-down to more detailed dashboards, logs, and traces, to make it as fast and easy as possible to find the fire when you smell smoke, even for someone who's on their first week.

Most dashboards, unfortunately, are useless. But then those have a place too: hanging on a wall somewhere, to show people how not-useless we are.


Elevated 413 or 413 errors could mean that a bug was introduced to the client/frontend that sends large payloads or cookies.

Elevated 400, 401, or 403 errors could mean that a bug was introduced in session or cookie handling middleware, client, or server code.

Elevated 200s means it could be a DDOS attack or issues with client-side polling.

Etc..

Alert on status code anomalies, not on volume/percentage of certain status code.


Yes, much prefer a workflow where alerts that actually need to be looked at just come into an alert-specific Slack channel with some pretty decent basic info. We did it this way at my last job with Datadog/Slack hooks. It was easy to setup and worked great. Staring at dashboards or even checking them every hour or whatever makes little sense.


> Eventually, you will add service F and no one will remember to go into the monitoring service and add it, but they will see the tags on the other lambdas and tag the new one correctly.

This does not match my intuition not my experience, unless there’s automation to check or enforce it.


HTTP status codes are like... suggestions at this point.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: