The approach at Google is to report the actual error rate up to the monitoring system, and then let the monitoring system decide at what threshold to alert with a warning message. This lets you catch a wide variety of errors, eg. if a single replica has a high error rate, that's probably a wildly different problem from if a whole rack of machines has a high error rate, which is different from every machine in the service having a high error rate, which is different from only the set of machines that were fed a specific query having a high error rate.
One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.
One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.