The approach at Google is to report the actual error rate up to the monitoring s...

The approach at Google is to report the actual error rate up to the monitoring system, and then let the monitoring system decide at what threshold to alert with a warning message. This lets you catch a wide variety of errors, eg. if a single replica has a high error rate, that's probably a wildly different problem from if a whole rack of machines has a high error rate, which is different from every machine in the service having a high error rate, which is different from only the set of machines that were fed a specific query having a high error rate.

One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.