At scale the rare events start to happen reliably. Hardware failures almost certainly cause ERROR conditions. Network glitches.
Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.
I think if someone is going be gotten out of bed that would be a critical rather then error. Generally I'd say in a large "live" system, errors end up raising Jira tickets, criticals end up ringing phones.
Most systems I’ve worked with can go completely offline without ever logging a critical error. Some coding errors or misconfiguration or failure in a critical system - enough to log an error - and nobody can get any useful work done. I’ve never seen sobering that cash convert those into critical errors. I’m used to critical errors being rare - certain failures of a server to start. Or infra problems.
Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.