> Opting out would just mean all your missing data alerts fire every time Datado...

palijer · on March 8, 2023

>knowing alerts are down, you can actively monitor using other tools/reporting for the duration of their incident.

If you just want notifications for when datadog is down, their StatusPage does a fine job of clearly communicating incidents.

I wouldn't want to rely on a "when multiple of our 'missing business metric' monitors alert, check and see if datadog is down" step in a runbook. I don't like false alerts. I don't like paging folks about false alerts. Waking up an oncall dev at 2am saying all of production is down when it is just datadog is bad for morale. Alert fatigue is a real and measurable issue with consequences. Avoiding false alerts is good. If the notification says "all of production is down" and that isn't the case, there is impact for that. I'd much prefer having a StatusPage alert at a lower severity and communication level say "datadog ingestion is down".

Instead, use their StatusPage notifications and then execute your plan from that notification, not all of your alerts firing.

>And why would you have no logs?

I mean Datadog logs/metrics etc. Currently, we are missing everything from them. We can still ssh into things etc, they aren't gone, but from the Datadog monitor's view in this scenario, they stopped seeing logs/metrics and would alert if Datadog didn't automatically mute them.

>why would you communicate to your customers without any idea of the scope or cause of the issue?

We prioritize Time To Communicate as a metric. When we notice and issue in production, we want customers to find out from us that we are investigating instead of troubleshooting and encountering the issue themselves, getting mad, and clogging up our support resources. Flaky alerts here don't work at all for us.