Hacker News new | past | comments | ask | show | jobs | submit login

> Opting out would just mean all your missing data alerts fire every time Datadog has an incident and you would then check, see that everything is missing, and then identify the cause as the Datadog incident.

You are missing the last step, which is that, knowing alerts are down, you can actively monitor using other tools/reporting for the duration of their incident.

And why would you have no logs? Even assuming you ingest logs through Datadog (they monitor on much than just logs and not everyone uses all facets of their offering), you would presumably have some way to access them more directly (even tailing output directly if necessary).

And lastly, why would you communicate to your customers without any idea of the scope or cause of the issue? It would likely be clear very quickly that Datadog was having issues when you see that all your metrics are suddenly discontinued without other ill effect.




>knowing alerts are down, you can actively monitor using other tools/reporting for the duration of their incident.

If you just want notifications for when datadog is down, their StatusPage does a fine job of clearly communicating incidents.

I wouldn't want to rely on a "when multiple of our 'missing business metric' monitors alert, check and see if datadog is down" step in a runbook. I don't like false alerts. I don't like paging folks about false alerts. Waking up an oncall dev at 2am saying all of production is down when it is just datadog is bad for morale. Alert fatigue is a real and measurable issue with consequences. Avoiding false alerts is good. If the notification says "all of production is down" and that isn't the case, there is impact for that. I'd much prefer having a StatusPage alert at a lower severity and communication level say "datadog ingestion is down".

Instead, use their StatusPage notifications and then execute your plan from that notification, not all of your alerts firing.

>And why would you have no logs?

I mean Datadog logs/metrics etc. Currently, we are missing everything from them. We can still ssh into things etc, they aren't gone, but from the Datadog monitor's view in this scenario, they stopped seeing logs/metrics and would alert if Datadog didn't automatically mute them.

>why would you communicate to your customers without any idea of the scope or cause of the issue?

We prioritize Time To Communicate as a metric. When we notice and issue in production, we want customers to find out from us that we are investigating instead of troubleshooting and encountering the issue themselves, getting mad, and clogging up our support resources. Flaky alerts here don't work at all for us.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: