
Mute uninteresting log noise with machine learning - matryer
https://blog.machinebox.io/mute-uninteresting-log-noise-with-machine-learning-daa19f6e222
======
WiseWeasel
Today's uninteresting log noise is tomorrow's critical data.

I've been loving Kibana for filtering and reporting on log data in flexible
and insightful ways, including automatically generated charts for certain data
sources.

~~~
sebcat
Yes. Using loglevels/priorities, facilities and identities in a sane way, your
logs would already be classified.

Let's say there's a service failure and I want to know what the service has
done prior to the failure. I wouldn't want a classifier to filter the logs in
that case, so that use case is out of the picture. What other use cases than
filtering are there for this? Maybe as a way to provide feedback to developers
to fix the log messages, as in "this thing that we log all the time can be
determined to never affect the process of trouble-shooting our services, and
the classifier thinks it's noise, so we'll remove it".

------
peterevans
It would be neat if MachineBox could sense whether log noise would be useful
in other contexts--e.g., as a metric that can be graphed. Or whether your
logging is lacking something that might be useful, or just lacking signal at
all (hey, user, your logs are _just_ noise!).

------
bpchaps
One of the ways that I do this (assuming you have access to unix utilities) is
to do:

    
    
      cat output.log | tr -d '[0-9]' | sort | uniq -c | sort -n
    

This is a fairly useful way of removing relatively useless information such as
timestamps and line numbers when you're looking for rare or unique events. The
alternative, I think, is to do a bunch of awk or sed magic, which isn't really
fun for anybody. It's especially useful in a time crunch when there's an
ongoing outage.

~~~
_ZeD_
onestly _I_ found really fun to do "a bunch of awk or sed magic"

~~~
bpchaps
Except for us weirdos :p

------
lopmotr
Is it possible to make a ML algorithm which has only "noise" data for training
and then identifies abnormalities? It seems like that's people do that easily
and it would be ideal for an application like this where you might not have
much training data on all the "not noise" types of examples.

Another application would be a security camera that detects unusual events
without having to train it on actual burglars.

~~~
slashcom
This is a subfield called anomaly detection.

------
kthielen
Maybe an easier way to go is to record it structured up front (it’s already
structured in the original application source anyway). This makes it much
easier to record efficiently (so you can record more data) and also much
easier to query efficiently, where eg you might invest time in machine
learning on logical data instead of having to mess around with text.

That’s what we do here anyway, it’s worked well for us:

[https://github.com/Morgan-
Stanley/hobbes/blob/master/README....](https://github.com/Morgan-
Stanley/hobbes/blob/master/README.md#storage)

------
foo101
Would not false negative (a critical log being muted) be a major concern while
using machine learning in this domain?

What if I never see a critical log because the trained model decided that it
is unimportant? How is such a situation generally solved in the industry?

~~~
sannee
I have limited experience, but I think that usually you would take this into
account when building your loss function and heavily penalize false negatives
during training.

------
arbie
It would be nice if logfile analysis tools (including ELK) supported logs that
were multiple lines per message. Does anyone know of such tools?

------
vinchuco
I really wanted this to be about real time sound editing and not about log
data.

------
matsucks
Awesome

