We have alerts set up that expect metrics for things like "orders placed" to alw...

shock-value · on March 8, 2023

This is convenient behavior up until you actually have an incident that coincides with theirs, in which case it becomes catastrophic because you had no idea that outside vigilance was required on account of their ingestion downtime. Not sure why you would laud this. Is it possible to opt out?

palijer · on March 8, 2023

In your scenario you would have no logs etc until the DD incident resolved.

Opting out would just mean all your missing data alerts fire every time Datadog has an incident and you would then check, see that everything is missing, and then identify the cause as the Datadog incident.

Its much better to have them handle it and auto-mute the impacted monitors than communicate to my customers every time about false alerts saying all our services are down.

shock-value · on March 8, 2023

> Opting out would just mean all your missing data alerts fire every time Datadog has an incident and you would then check, see that everything is missing, and then identify the cause as the Datadog incident.

You are missing the last step, which is that, knowing alerts are down, you can actively monitor using other tools/reporting for the duration of their incident.

And why would you have no logs? Even assuming you ingest logs through Datadog (they monitor on much than just logs and not everyone uses all facets of their offering), you would presumably have some way to access them more directly (even tailing output directly if necessary).

And lastly, why would you communicate to your customers without any idea of the scope or cause of the issue? It would likely be clear very quickly that Datadog was having issues when you see that all your metrics are suddenly discontinued without other ill effect.

palijer · on March 8, 2023

>knowing alerts are down, you can actively monitor using other tools/reporting for the duration of their incident.

If you just want notifications for when datadog is down, their StatusPage does a fine job of clearly communicating incidents.

I wouldn't want to rely on a "when multiple of our 'missing business metric' monitors alert, check and see if datadog is down" step in a runbook. I don't like false alerts. I don't like paging folks about false alerts. Waking up an oncall dev at 2am saying all of production is down when it is just datadog is bad for morale. Alert fatigue is a real and measurable issue with consequences. Avoiding false alerts is good. If the notification says "all of production is down" and that isn't the case, there is impact for that. I'd much prefer having a StatusPage alert at a lower severity and communication level say "datadog ingestion is down".

Instead, use their StatusPage notifications and then execute your plan from that notification, not all of your alerts firing.

>And why would you have no logs?

I mean Datadog logs/metrics etc. Currently, we are missing everything from them. We can still ssh into things etc, they aren't gone, but from the Datadog monitor's view in this scenario, they stopped seeing logs/metrics and would alert if Datadog didn't automatically mute them.

>why would you communicate to your customers without any idea of the scope or cause of the issue?

We prioritize Time To Communicate as a metric. When we notice and issue in production, we want customers to find out from us that we are investigating instead of troubleshooting and encountering the issue themselves, getting mad, and clogging up our support resources. Flaky alerts here don't work at all for us.

mmelgard · on March 8, 2023

IIRC you can also just set up a monitor to alert if there is no data on a given metric.