More than that. Currently there are a lot of shady security suppliers who do nothing except imitating a good protection. They put fancy design into UI coupled with a mediocre 3rd party anti-virus engine, and presto, you have yet another source of false positives. They do not react to reports nor provide any meaningful support. Instead, they are pretty busy with marketing, buzz and sales. Yes, what could go wrong.
Its also a space where data collection and ground truth are expensive, so people try to boot strap solutions in odd ways. Its tough.
As the paper author, I'm not going to name any companies. Especially since one was sharing data with us for this work, and I don't want to inject any of my biases into a particular name. But we've done some fairly extensive testing of the models we've built. Both on data they gave us and data we got ourselves, and our model don't just label everything as malware. It is not quite AV quality at the moment, but we also have limited training data and are really still exploring the different ways one can tackle this problem with AI.
Jon did a great job in this blog post, but if you want more details on why there is so much work to still be done, I'd recommend reading our paper! We tried to make the intro section accessible to people with no AV/malware background.
I sure hope it wasn't Tehtris, because their "AI detection" is nothing short of random.
The problem is that to be a "good" program, yours has to do something other "good" programs do. And there aren't that many of those that use different software development paradigms (not different versions of the same one). So new different programs tend to be marked as malware.
This also brings up a very easy way to circumvent such AVs. Simply modify an existing goodware program and it will be marked as goodware. Add some obfuscation/polymorphism and it would be virtually impossible to detect such malware using static analysis/AI-based AVs.
Average size for malware is ~100kb-200kb btw. This is way smaller than almost any software besides some console games.
What happens in the wild is that one malware author releases a lot of very similar polymorphic or differently-compiled malware so it ends up being trivial to identify it. For example, they could have picked up a small icon that is common to half of the malware or some internal library that is used in a large portion of them. Then a week later the nature of the malware changes and you would identify a lot less.
Another thing to consider is that in many cases, a tiny modification to a known good program can make it malicious. This includes such things as changing the update URI. I don't see how they could catch such malware using this method so the 98% detection seems like a very unrealistic number.
Just to present an example:
One can train a simple logistic regression on some metadata features where the malware comes from one source and easily identify almost all of them correctly, while failing to identify malware from most other sources.
Having said that, it's a pretty cool novel approach and I'd love to try it.
The dataset is small by AV standards, but we aren't an AV company. We can only use as much as real AV companies are willing to share with us. If you'd like to share more, we would be happy to take it :)
The model is fairly robust to new data, and we tested it with malware from a completely separate source than our training data - so there shouldn't be any share items like icons between the training set and the 2nd testing set. However, we aren't arguing that is of an AV quality today. The main purpose of this research was to get a neural network to train on this kind of data at all, as it is non trivial and common tools (like batch-norm) didn't translate to this problem space.
We are looking at the modification issue! I can't share any results yet since we have to go through pre-publication review, but the issue isn't unknown to us!
Our work so far has found that data quality used in training is the biggest factor in the performance you should expect. Which isn't surprising, but it seems to be a bigger problem in this space. Some of our first work was dealing with that issue and showing how critical it can be http://www.readcube.com/articles/10.1007/s11416-016-0283-1?a...
The reality is there is a big gray area between the two classes. Some cases are really hard to determine, and would be something that would lead to errors in production. Some examples:
What if it is of malicious intent, but the author messed it up and so it doesn't do anything. Is it still malware?
What if it's a program used for encryption for security, but used by malware to create ransomware? Is it malicious now?
What if its a benign program, but a bug causes it to destroy files. Is it malicious?
Some programs are maybe not malicious, but just annoying (like browser toolbar installers). What do we call it? Some systems have a "Potentially Unwanted Software" category for these guys.
Ultimately, it's not easy. Thankfully most binaries are fairly cut-and-dry in terms of which side of the fence do they belong. The hope is that with enough labeled data, we can do a good job for the majority of cases. We don't expect it to ever be perfect. Hitting the hard to distinguish samples is definitely something dig into in the future.
Regardless this seems like a promising technique, even with that potential caveat. Since most malware out in the wild isn't that sophisticated, this is likely quite effective.
Endgame has a great paper on this problem, and showed how they can defeat regular AVs with some fairly simple modifications that don't impact the malware's execution. https://www.blackhat.com/docs/us-17/thursday/us-17-Anderson-...
"I don't know, it always says that. Just click 'proceed anyway'"
From what we are seeing (as a desktop software vendor) all fancy-shmancy AI-based antiviruses absolutely "excel" at false positive detection. It's more of a miracle when they do NOT flag something that's not of "hello world" variety as a malware. And I wish I were kidding.
A lot of that issue comes from people using bad datasets. One of our first papers was about that ( http://www.readcube.com/articles/10.1007/s11416-016-0283-1?a...
), and showed that using the data most people use in their research, benign data collected from clean Microsoft installs, is not sufficient. The model will literally learn to look for the string "Copyright Microsoft Corporation" to decide if something is benign. Everything else ends up getting marked as malicious.
We are using better data in this work, and it does not suffer from this problem. It is not ready to be a real production AV, but it does a fairly good job at separating out benign vs malicious files and dealing with non-trivial examples of both.
Still haven't found a good way to deal with high bias or high variance myself.
At a more technical level, the approach we take in this paper (and most of my research) is fairly orthogonal to what most AV vendors are doing. Even compared to the AI based solutions.
The idea here was to throw away everything we know about the file being a valid Windows PE binary, and try and let the network learn what it needs on its own. Its making the problem harder, but allows us to re-purpose the same code for PDFs, Word Docs, RTF - basically any file format we can get data for. This gives us a lot of potential flexibility that others don't have.
It doesn't seem like you've looked into this. The interesting data in PDFs and Office docs is all encoded, often multiple times. E.g. OOXML docs are ZIP files and store macros in an OLE container, where they're further encoded in streams.
You can kind of get away with not parsing PE files, although you're missing out in that case. For PDFs, Office docs, and most other non-binary, non-script types, though, you have no choice but to parse.
This paper looked specifically at PE files because it's the hardest case of any of the file types (in our opinion & experience), and it's the one we have the most data for. We've built models for many other file types with success using much less data (though we are always looking for more).
Without looking inside the stream, you can't know whether it's bad. The rest of the PDF is incidental and can be swapped out with no change to the attack.
Can your approach produce a model to detect these PDFs? Sure, by overfitting a small/homogeneous data set. Which, to be fair, is almost impossible not to do, because sourcing and curating data is the hardest part of security-related data science. But in the wild, your miss rates will skyrocket.
This will all make more sense if you ever deploy. Then you'll see issues even in your PE model, for example with installers, signed files, parasitics, generic packers, p-code, DLLs, drivers, on and on.
For the Neural Network type stuff we used in this work, I would recommend Michael Nielsen's awesome website as a starting point http://neuralnetworksanddeeplearning.com/ and keras https://keras.io/ is the easiest NN library to pick up and get something going with. Andrew Ng's mooc https://www.coursera.org/learn/machine-learning is also a good starting point for some slightly more general machine learning background.
If you want to avoid the math (not my personal recommendation), I would start there and just mess around. Build small things and start reading more as you get comfortable. Its definitely an area I would encourage learning in an iterative way: try to learn a small amount, try and apply it, repeat.