As a desktop software vendor, I gathered a lot of perspective on this topic. AI anti-virus engines tend to mark everything as malware. Literally everything, even the simplest well-behaving "Hello World" apps. It looks like a pure manifestation of greed and laziness to me: smash a black box, pretend it does all the heavy lifting and relax taking all the orders you can take in this life.
More than that. Currently there are a lot of shady security suppliers who do nothing except imitating a good protection. They put fancy design into UI coupled with a mediocre 3rd party anti-virus engine, and presto, you have yet another source of false positives. They do not react to reports nor provide any meaningful support. Instead, they are pretty busy with marketing, buzz and sales. Yes, what could go wrong.
I'd be cautious about making broad statements regarding how AI based AVs tend to behave. It's a new space with a lot of competitors, of varying degrees of quality, with varying degrees of actual AI in use. Not all of them are even tackling the exact same problem, so comparisons can get murky fast.
Its also a space where data collection and ground truth are expensive, so people try to boot strap solutions in odd ways. Its tough.
As the paper author, I'm not going to name any companies. Especially since one was sharing data with us for this work, and I don't want to inject any of my biases into a particular name. But we've done some fairly extensive testing of the models we've built. Both on data they gave us and data we got ourselves, and our model don't just label everything as malware. It is not quite AV quality at the moment, but we also have limited training data and are really still exploring the different ways one can tackle this problem with AI.
Jon did a great job in this blog post, but if you want more details on why there is so much work to still be done, I'd recommend reading our paper! We tried to make the intro section accessible to people with no AV/malware background.
The reason simple hello world apps are marked as malware by AI-based engines is that small programs tend to be malware. There are probably close to no hello world programs in their dataset.
The problem is that to be a "good" program, yours has to do something other "good" programs do. And there aren't that many of those that use different software development paradigms (not different versions of the same one). So new different programs tend to be marked as malware.
This also brings up a very easy way to circumvent such AVs. Simply modify an existing goodware program and it will be marked as goodware. Add some obfuscation/polymorphism and it would be virtually impossible to detect such malware using static analysis/AI-based AVs.
Yes, because they try to strip the binaries as much as possible so that the file size is smaller. If you compile a regular "hello_world.c" with gcc -O3, the size is 8 kb. You can def make a malware that is way smaller that does something simple like change some registry value to some uri.
Average size for malware is ~100kb-200kb btw. This is way smaller than almost any software besides some console games.
Having worked on a machine learning-based AV for several years, I'd like to point out that the dataset choice here is extremely important and they seem to have a pretty small one considering the number of possible variations and the choice of model.
What happens in the wild is that one malware author releases a lot of very similar polymorphic or differently-compiled malware so it ends up being trivial to identify it. For example, they could have picked up a small icon that is common to half of the malware or some internal library that is used in a large portion of them. Then a week later the nature of the malware changes and you would identify a lot less.
Another thing to consider is that in many cases, a tiny modification to a known good program can make it malicious. This includes such things as changing the update URI. I don't see how they could catch such malware using this method so the 98% detection seems like a very unrealistic number.
Just to present an example:
One can train a simple logistic regression on some metadata features where the malware comes from one source and easily identify almost all of them correctly, while failing to identify malware from most other sources.
Having said that, it's a pretty cool novel approach and I'd love to try it.
The dataset is small by AV standards, but we aren't an AV company. We can only use as much as real AV companies are willing to share with us. If you'd like to share more, we would be happy to take it :)
The model is fairly robust to new data, and we tested it with malware from a completely separate source than our training data - so there shouldn't be any share items like icons between the training set and the 2nd testing set. However, we aren't arguing that is of an AV quality today. The main purpose of this research was to get a neural network to train on this kind of data at all, as it is non trivial and common tools (like batch-norm) didn't translate to this problem space.
We are looking at the modification issue! I can't share any results yet since we have to go through pre-publication review, but the issue isn't unknown to us!
VirusTotal is owned by Google since 2012. https://virusscan.jotti.org/ is another one, even older. It says 2004, but I met that guy in 2002 or 2003 and back then he had this up already. URL might've been different.
It occurred to me it could be a useful source. I never actually looked into it, so I can't be sure it'd be useful, but since you are aware of it, it's a good indication it's not.
Feel free to send an email if you have any questions when trying it! Since it is a static technique we don't expect it to become quite as good as what you could get with a dynamic approach, but we've been happy with our results thus far.
Our work so far has found that data quality used in training is the biggest factor in the performance you should expect. Which isn't surprising, but it seems to be a bigger problem in this space. Some of our first work was dealing with that issue and showing how critical it can be http://www.readcube.com/articles/10.1007/s11416-016-0283-1?a...
Could someone explain something to me? The difference between malware and useful-ware can be just 1 negation instruction. e.g. you list all files in a folder recursively and delete the ones that end with ".tmp", or you delete the ones that DON'T end with that. How can anyone expect an antivirus to distinguish these?
Really, the benign-vs-malicious question is an oversimplification. But thats what we have data for, and what most people focus on.
The reality is there is a big gray area between the two classes. Some cases are really hard to determine, and would be something that would lead to errors in production. Some examples:
What if it is of malicious intent, but the author messed it up and so it doesn't do anything. Is it still malware?
What if it's a program used for encryption for security, but used by malware to create ransomware? Is it malicious now?
What if its a benign program, but a bug causes it to destroy files. Is it malicious?
Some programs are maybe not malicious, but just annoying (like browser toolbar installers). What do we call it? Some systems have a "Potentially Unwanted Software" category for these guys.
Ultimately, it's not easy. Thankfully most binaries are fairly cut-and-dry in terms of which side of the fence do they belong. The hope is that with enough labeled data, we can do a good job for the majority of cases. We don't expect it to ever be perfect. Hitting the hard to distinguish samples is definitely something dig into in the future.
It's almost like an inversion of the usual image classification problems, where you want to identify the broad strokes of the image while reducing sensitivity to the value of individual pixels. Instead, here, you want to ignore the overall shape of the program (which is likely to be benign) and focus on tiny details to pick up sneaky hostile behaviour.
You'd probably have a weight of some sort so if the program wants to modify N files, it gets 0.23N 'suspicious-points'. To get how many files the code wants to modify, you have standard symbolic execution without actually performing the syscalls/WINAPI. Same with checking if the deletion is in the main code path vs behind some event trigger. There's no silver bullets but you can certainly factor 'normal' behavior into your analysis.
sandbox and analyze the end result (file actions, system-level actions) of running the software - the actual results of use don't lie. until they do, where you have context-sensitive viruses that detect VM's and are generally designed to not trigger when analyzed.
Malware that just does that is not going to be very useful to a blackhat, who generally has some purpose in mind besides just wanton destruction. They are going to want to control the computer, ransom the data, or spread to other machines.
This article left me thinking about the possibility of adversarially constructed malware. It seems like an adversarial network could modify existing malware to look like benign machine code from the classifier's perspective. This might result in an "arms race," similar to text spinners vs spam classifiers.
Regardless this seems like a promising technique, even with that potential caveat. Since most malware out in the wild isn't that sophisticated, this is likely quite effective.
This is something we are looking at! It is a harder problem to create adversarial examples in the malware space, because you can't make arbitrary changes and have the code still work.
From what we are seeing (as a desktop software vendor) all fancy-shmancy AI-based antiviruses absolutely "excel" at false positive detection. It's more of a miracle when they do NOT flag something that's not of "hello world" variety as a malware. And I wish I were kidding.
A lot of that issue comes from people using bad datasets. One of our first papers was about that ( http://www.readcube.com/articles/10.1007/s11416-016-0283-1?a...
), and showed that using the data most people use in their research, benign data collected from clean Microsoft installs, is not sufficient. The model will literally learn to look for the string "Copyright Microsoft Corporation" to decide if something is benign. Everything else ends up getting marked as malicious.
We are using better data in this work, and it does not suffer from this problem. It is not ready to be a real production AV, but it does a fairly good job at separating out benign vs malicious files and dealing with non-trivial examples of both.
Cross validating the classifier/hyper parameters and a good scoring metric (Matthews correlation coefficient) go a long way. Since the classes are very imbalanced, an appropriate scoring metric is very important. Even more importantly, train with lots of high-quality data whenever possible. Anecdotally many seem to obsess over the particular classification algorithm, while neglecting data quality. A classifier is only ever as good as its training set.
At a very high level, yes. But the same could be said for anybody in the AI-AV space.
At a more technical level, the approach we take in this paper (and most of my research) is fairly orthogonal to what most AV vendors are doing. Even compared to the AI based solutions.
The idea here was to throw away everything we know about the file being a valid Windows PE binary, and try and let the network learn what it needs on its own. Its making the problem harder, but allows us to re-purpose the same code for PDFs, Word Docs, RTF - basically any file format we can get data for. This gives us a lot of potential flexibility that others don't have.
> allows us to re-purpose the same code for PDFs, Word Docs, RTF
It doesn't seem like you've looked into this. The interesting data in PDFs and Office docs is all encoded, often multiple times. E.g. OOXML docs are ZIP files and store macros in an OLE container, where they're further encoded in streams.
You can kind of get away with not parsing PE files, although you're missing out in that case. For PDFs, Office docs, and most other non-binary, non-script types, though, you have no choice but to parse.
We have looked into it, it's just not in this paper. It actually works better on other file formats. PDFs are really easy to do with even simpler techniques, no parsing is needed. Modern office docs need to be unzipped first, but that's not complicated. Old office 97 docs are also a common vector that doesn't need to be processed.
This paper looked specifically at PE files because it's the hardest case of any of the file types (in our opinion & experience), and it's the one we have the most data for. We've built models for many other file types with success using much less data (though we are always looking for more).
I'm responding to your comment about parsing, which doesn't jibe with reality. It's not a question of data science, just of what's visible to the naïve byte-driven approach.
E.g. in PDFs malicious JavaScript might be buried in an XFA stream with /Type /EmbeddedFile and /Filter /FlateDecode -- a ZIP file, in other words. Nothing about this is suspicious; benign PDFs do it too.
Without looking inside the stream, you can't know whether it's bad. The rest of the PDF is incidental and can be swapped out with no change to the attack.
Can your approach produce a model to detect these PDFs? Sure, by overfitting a small/homogeneous data set. Which, to be fair, is almost impossible not to do, because sourcing and curating data is the hardest part of security-related data science. But in the wild, your miss rates will skyrocket.
This will all make more sense if you ever deploy. Then you'll see issues even in your PE model, for example with installers, signed files, parasitics, generic packers, p-code, DLLs, drivers, on and on.
As an engineer working on security applications, how do I learn more about the machine learning/deep learning discussed in this article? Is it enough to have a general understanding and use off the shelf tools/libraries, or I should study the nitty gritty math? I have a CS degree but haven't touched math in years.
If you want to avoid the math (not my personal recommendation), I would start there and just mess around. Build small things and start reading more as you get comfortable. Its definitely an area I would encourage learning in an iterative way: try to learn a small amount, try and apply it, repeat.
More than that. Currently there are a lot of shady security suppliers who do nothing except imitating a good protection. They put fancy design into UI coupled with a mediocre 3rd party anti-virus engine, and presto, you have yet another source of false positives. They do not react to reports nor provide any meaningful support. Instead, they are pretty busy with marketing, buzz and sales. Yes, what could go wrong.