
Applying machine learning to Infosec - adamnemecek
http://conf.startup.ml/blog/infosec
======
pilooch
Machine learning is pervasive in the new products targeted at cybersecurity
and infosec. From malware to anomaly detection among billions of logs, it's
been part of the demand of my customers for at least four years now. Most of
the largest companies are already running and testing a series of novel
proprietary or custom in-house systems.

The main differences with straight 'supervised' machine learning are the lack
of labels, and the unreal volume of data (cheaply produced by machines, in
machine time, we're talking microseconds to milliseconds here). So
unsupervised learning is king in this domain, and often security operators
have to keep an eye on and interpret the results. Another difference with
other fields is that datasets are rare, mostly because of privacy as the logs
can be very revealing. For this reason the market exhibits a lock that cannot
be overcome by everyone I believe. Basically, you sort of need to be in the
place already.

Here is a recent very high level survey on the topic (a TC report, not that
bad for once), [https://techcrunch.com/2016/07/01/exploiting-machine-
learnin...](https://techcrunch.com/2016/07/01/exploiting-machine-learning-in-
cybersecurity/)

Here are two of our own Open Source tooling we use the most in application:

\- very efficient C++ map-reduce feature generation for logs (for ML and
analytics): [https://github.com/soprasteria/cybersecurity-
miw](https://github.com/soprasteria/cybersecurity-miw)

\- machine learning / deep learning server:
[https://github.com/beniz/deepdetect](https://github.com/beniz/deepdetect)

The ML cybersecurity + infosec field is still young, but moving very fast, a
lot of new startups and (somewhat opaque) products.

~~~
sp_
Great summary! By nature of my job (eng lead of a major mobile malware
detection team) I have a lot of startups pitch their ML solutions to me. A
couple of thoughts:

\- There are no publicly available data sets for training available. There are
a few small ones and a few old ones, but they don't reflect the reality of
2016. Companies that approach me and pitch me solutions to the malware of 2012
are not useful.

\- The majority of mobile malware is based on some kind of social engineering.
On a code level these are indistinguishable from legitimate applications (the
same APIs are used in the same fashion). The only difference is whether app
behavior meets user expectations or not. Making this decision automatically
seems intractable so far.

\- Malware is not really a well-defined term. There is phishing, toll fraud,
Trojans, privilege escalation exploits, ... If you generically look for
malware, the signals you will look for are going to approach the complete set
of APIs made available by your OS. Your results will just be a giant blob
where everything is connected. Pick a single malware category and focus on
just that at a time. ML signals for priv esc will look _very_ different from
those for phishing.

\- ML is sexy. Malware analysis is not. Startups seem to hire too many ML
people and not enough malware analysis people. I've had startups pitch to me
that had literally zero people on staff who knew what mobile malware actually
looked like. They just did anomaly detection and then tossed the results over
to my team to verify the results. That's not how it works. We're not your QA
team. :)

~~~
pilooch
Hey, just finished a malware ML custom system for one of the largest european
corporations, large enough that some malware is targeted at them. Result is
97% accuracy (they did retrain and check on their own held out dataset). More
careful analysis is needed (many malware have high entropy 'zones' that may
help the classifier find the right category), but overall it does work.

See the Microsoft / Kaggle challenge on classifying malware families, winning
solution is > 99% accuracy IIRC.

~~~
Eridrus
Can you describe a security setting where 97% accuracy is actually useful?
Unless the events you're looking at are low volume or you somehow have much
more malicious data than everyone else that seems like a recipe for your
results being primarily FPs.

~~~
lmeyerov
For context, a company can easily get ~1B security-related events a day, so
even reporting say 0.1% of those wrong a day means some poor junior analyst
has 1,000,000 tickets to slog through. If you expand that to full packet
captures as suggested in the article... ouch.

(We do some cool visual analytics work here, including unsupervised learning /
classification, and target more of the problem of "given an incident you're
already investigating, what else should you now look at from across all your
tools?")

------
astazangasta
As machines (computers) grow more sophisticated and start to resemble living
things in their complexity, we're going to have to start dramatically
improving their regenerative capacity and their ability to self-diagnose. Else
the future will be littered with sad, broken robots.

Training a machine implies some sort of evolutionary model (a training set
describes a fitness landscape). Maybe this will work (doubtful, across such a
large and variable surface), but how about thinking about this on a
fundamental design level? How does a computer know it is working properly?

~~~
digi_owl
> How does a computer know it is working properly?

That is the crux of the issue. Any command that potential malware may give,
may also be given legitimately. How one tell those two apart is context. And
context is a hard subject even for humans.

Even biology can't get it straight. After all, some of our most resilient
diseases exploit the normal signals of cells for their own purposes.

------
ThePhysicist
Capturing data flows and applying machine learning to them seems to be one of
the hottest topics in the infosec community right now, so I would not agree
that hardly anyone is doing this.

~~~
netman21
Agree. The original article seems like it was written ten years ago. In
infosec the industry is usually a decade ahead of academe.

------
ntoshev
It's not just security - I'd expect cache eviction algorithms, schedulers,
constraint solvers and many others to also benefit from ML. Anything dealing
with hairy problems.

~~~
BrainInAJar
ZFS uses a sort of machine learning for cache eviction (
[https://en.wikipedia.org/wiki/Adaptive_replacement_cache](https://en.wikipedia.org/wiki/Adaptive_replacement_cache)
)

------
code_research
also some interesting things here:
[http://www.mlsec.org/](http://www.mlsec.org/)

