
Ask HN: How do you validate your alerts still work? - sethammons
HN folk, how do you ensure the continual integrity of your individual alerts? Like a DB backup is only valid if you verify you can restore, an alert is only valid if it will actually alert. I&#x27;m thinking regular integration testing that ensures alerts still work are in order, but not sure if anyone does this.<p>This came front of mind because we recently had a bad deploy where some stuff stopped working right. The correct error logs were generated, but a formerly working alert that checks for those error logs had been poorly modified and as a result no longer worked. Had this alert been in an automated test suite or similar, we would have known ahead of time that the alert had been borked.<p>HN, how do you handle this kind of situation?
Thanks!
======
davismwfl
We just had something similar earlier this year on one of our systems where it
sends alerts to clients, but it failed to send them (in our case though the
log showed they sent).

The solution for us took on three unique steps:

1\. Make sure unit tests exist for all permutations of failure (network, bad
keys etc) and are mandatory to validate that all alerts are operational when
new code is being created. Add a commit hook if needed.

2\. We created a specific new internal demo account where upon new code being
deployed we run a partially automated process (someone needs to push a button
right now) where data that triggers specific different types of alerts is
inserted into our system to force an alert to be sent out (this is done in
production). We did this because we found that unit testing covered one
scenario, but a number of other scenarios exist which can be unique to
production, e.g. network access, permissions, different keys etc.

3\. Add daily checks (or more/less often depending on the need) that all third
parties (even internal) your code depends on for alerts are up and
operational. We have one where we do hourly checks on the system now to insure
it has not failed, another one is only done weekly because the risk of not
getting the alert is super low and not client facing. We found in some cases
we have one vendor who was sending success even though the alert was never
delivered (it basically ate our request), which was found through these
automated checks because say 95% went fine but we would miss an hour here or
there -- that allowed us to go back through the logs and get with the vendor
to see wtf.

All of this seems pretty straight forward and obvious honestly, and we had
most of it in place but found it just wasn't robust enough in some cases and
we weren't doing #2 & #3 with enough granularity.

------
gtsteve
We have some manual QA scripts where we introduce error conditions like
manually killing processes, containers, etc and make sure the alert is fired
(after notifying admins that we will be testing).

We use Terraform to keep the configurations in sync - most alerting is done
via AWS CloudWatch. This means we can test these conditions in the staging
environment without affecting the production environment.

It's not a very academically satisfying way to do it but it's a process done
every few months and takes about 30 minutes to run through the plan.

One day we'll will automate this and perform far more frequent tests but it's
quite a pragmatic solution for now - our application changes frequently but
the cloud configuration doesn't really.

------
sethammons
More clarification. In this particular case, it was a splunk alert that had
been altered. The class of alert that I'm thinking about is third party
monitoring of events/metrics/logs that your service or application emits.

