
DeepLog: Anomaly Detection and Diagnosis from System Logs (2017) [pdf] - StreamBright
https://acmccs.github.io/papers/p1285-duA.pdf
======
Eridrus
I am skeptical of anomaly detection since in my experience anomalies are
common and diverse and don't actually matter, so I expect these systems to
basically inundate people with false positives.

Their offline training accuracy is garbage: 16% precision, so all of the real
work is basically being done in the online training portion, which gets it to
a respectable 82%+ precision.

But they don't tell you how many alerts they had to label to get those
numbers. Maybe over the long run you get those numbers, but you really want to
know if it takes 10 or 10,000 examples to get there.

Also, their dataset distribution is very different to reality: they have 7% of
their dataset annotated as real anomalies; I don't think anyone in the real
world wants 5% of their log entries to get flagged as anomalies. So I expect
their precision numbers to be far worse on more realistically distributed
logs.

~~~
jstarfish
It's a good time to make some money baffling people with bullshit in the
cybersecurity space.

Of course if you let an ML-powered "anomaly detection" engine run rampant on
your logs, it's going to find anomalies...just like if you hire a ghost
hunter, you'll be informed that your house is haunted. In the end, ghost
chasing is all this anomaly nonsense turns out to be-- the justifications for
conclusions by ML practitioners and ghost hunters alike tend to be equally
mumbly and hand-wavy.

Me working from home is technically an anomaly, and one these systems are all
too eager to flag. We get random logins from overseas VPSes-- it's an anomaly!
Oh, wait, no, we onboarded a client application. Oh, look, a random login from
China for a US-based employee with no history of foreign logins! Yeah, that
guy just started in a new position with travel requirements. Hey, this IP just
tried to log into 5000 user accounts! Congratulations, you just alerted me to
the existence of carrier NAT.

None of this saves any time and usually wastes it, since it stirs up paranoia
where none was otherwise warranted. It's a fun toy that gives the appearance
of being productive when all it's actually doing is generating literally
endless busywork. Good for justifying your SOC budget I suppose.

But in the end nobody wants to pay a quarter-million dollars for a black box
that just sits there quietly-- if it's not constantly drawing attention to
itself and all the badness it's pretending to find, you're not going to have
any reason to renew the license.

"Renew it? Why? This thing didn't find anything at all last year."

~~~
noir-york
So what does your organisation use for intrusion detection? Humans eyeballing
logs doesn't scale. Rule-based approaches?

~~~
russh
Mostly user complaints...

------
pilooch
I do, with others, a lot of ML anomaly detection in the cyber security
context. Deeply has interesting ideas, especially the encoded logs via lstm.
The work was presented at a workshop at NIPS 2017.

One of the interesting facts we ve been able to measure empirically over the
past few years is that the statistical anomalies' scores magnitude as
reconstruction error are uncorrelated with the criticality of the anomaly in
terms of security / threat.

This means that in practice SOC operators need to label on top of the anomaly
detection and a supervised model can do the reranking after a while.

------
thaumaturgy
This is an interesting paper, but it sort of sidesteps one of the harder
problems in generalized machine learning for log analysis:

> As shown by several prior work [9, 22, 39, 42, 45], an effective methodology
> is to extract a “log key” (also known as “message type”) from each log
> entry. The log key of a log entry e refers to the string constant k from the
> print statement in the source code which printed e during the execution of
> that code.

So if you're looking for a way to apply this to log data that varies wildly,
like site access logs, you still have the difficult problem of converting the
URIs to the numeric vectors needed by ML algorithms without losing the
significant parts of the input.

------
asavinov
Here is another generic approach to anomaly detection from event data which
has been used for analyzing logs received from automatic lawn mowers:

[https://www.researchgate.net/publication/323971244_Detecting...](https://www.researchgate.net/publication/323971244_Detecting_Anomalies_in_Device_Event_Data_in_the_IoT)

It allows for using different algorithms like one class SVM or MDS (including
custom algorithms). It also allows for defining custom domain specific
features as integral part of its analysis engine. In particular, for log
analysis, frequencies of various event types have been generated.

------
kthielen
Once you discard type structure, it's a fruitless task to try to reconstruct
it.

It's much easier to make sense of logs when we don't discard that type
structure.

[https://github.com/Morgan-Stanley/hobbes#storage](https://github.com/Morgan-
Stanley/hobbes#storage)

~~~
StreamBright
Would you mind explaining what is the primary use of Hobbes?

------
corneliu_p
Here are some slides [https://github.com/charles-typ/DeepLog-
instroduction](https://github.com/charles-typ/DeepLog-instroduction)

------
lindig
The authors are using their own Spell[1] tool to parse syslog files into
patterns that represent the fixed part of printf-like log statement. Is the
source of that available? At the heart of this is a tree-based construction
that is not well explained.

[1]
[https://www.cs.utah.edu/~lifeifei/papers/spell.pdf](https://www.cs.utah.edu/~lifeifei/papers/spell.pdf)

------
boltzmannbrain
Would be interested to see the results on a benchmark dataset for online
anomaly detection, comparing to those approaches used in practice:
[https://github.com/numenta/NAB#the-numenta-anomaly-
benchmark...](https://github.com/numenta/NAB#the-numenta-anomaly-benchmark-)

------
ram_rar
Has anyone in real production systems benefit from anomaly detection of logs ?
I have usually converted some of the important events in logs to metrics and
alerted users based on simple moving averages / spikes etc. I have usually
started with alerts from system level metrics and then checked the logs.
Applying Anomaly detection to logs directly hasn't worked for me yet.

~~~
gesman
O yes.

Applying K-Means clustering across different features of online traffic always
shows some weird and often malicious stuff:

[https://imgur.com/a/qWMgUo0](https://imgur.com/a/qWMgUo0)

~~~
slv77
Care to share more about what kinds of features you cluster on?

------
bhnmmhmd
I was wondering, has anyone here applied cluster analysis techniques for
anomaly detection?

I read a paper that used it for insurance fraud detection, but I don't know
what other fields are using clustering to detect frauds and abnormalities?

I'd be grateful if someone can help.

~~~
gesman
Yes, tons of that.

See this - using K-Means clustering for anomaly detection in web traffic:

[https://imgur.com/a/qWMgUo0](https://imgur.com/a/qWMgUo0)

Using DBscan clustering for anomaly detection in healthcare claims data
(detecting doctors who anomalously prescribing opioids). Using public CMS data
set from 2015.

4 out of 8 top anomalies (doctors) were later actually convicted of crimes or
gone into all sort of troubles with DOJ:

[https://imgur.com/a/6wFWTg5](https://imgur.com/a/6wFWTg5)

[https://imgur.com/a/f721ndb](https://imgur.com/a/f721ndb)

(Splunk Enterprise + free apps was used to ingest data and build all this
logic and dashboards)

~~~
bhnmmhmd
Thank you so much, it really was helpful.

------
cphoover
Is there a github for DeepLog?

~~~
mino
I had contacted the first author in March and the answer was that "our source
code is currently not available because of a pending patent application".

~~~
cphoover
that's lame...

------
sscarduzio
Elastic.co X-Pack has machine learning for log anomalies and people buy and
use that stuff. Has anybody direct experience with that?

~~~
dimitry12
I don't but I was researching the space and
[https://www.anodot.com/](https://www.anodot.com/) has the most feature-rich
product - though they only discover anomalies in numeric time-series.

~~~
ygur
Check out www.loomsystems.com for a spot on AI log analysis

------
matachuan
Pure trash

~~~
dang
This breaks the HN guidelines, which ask you not to post shallow dismissals.
Better options would be either to factually explain what the problems are, so
that people can learn something, or not to post.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

