I was thinking this would be a cool area of research for me to try programming again, but it seems so daunting I am not sure where to start.
As an software developer, I generally use log levels to indicate severity in my logs. So grepping for ERROR should catch anything I had the foresight to log at the ERROR level.
Simple heuristics like the number of WARN level logs a minute may be useful.
Beyond that it sounds interesting. It may be hard to do in a general way, so focusing on Apache logs or something common may be a simpler task.
Very cool stuff. Do you use it?
When I say too much overhead, I'm referring to the carbon proxy and redis requirements. We found that just using the json output from graphite was sufficient to feed a trend monitoring system.
The output is pretty sensitive, moreso than Icinga2 (Nagios) expects, so we had to turn down a few of the "is this really down" re-checks, since it would silence legitimate trend alerts.
It emails me any log entires it doesn't know about. I did have to add a large number of ssh lines that it should not bother me about, but other than that it works very well and I find it very useful.
So you can use it for other usages (such as sending an admin a mail if suddenly your server sends 500 errors, or a unusual amount of 404 errors for instance)
I like fail2ban, a lot, and alternatives in that field, but when I looked at the Arch Linux package last time there were dozens of commented-out, but heavily commented nonetheless regexp template files like you describe. I think this would be a neat machine learning thing.
What I am going for: use AI to train a passive entry-level sysadmin to warn you.