Reasoning about the interpreted behaviour of a program as "malware," is a really fun problem. Detecting known malware, variants, work-alikes, known behaviours, and then intent and "badness," are layers of abstraction that can't be solved bottom-up.
Badness is an interpreted reflection of the effect of executed code, so this "missing interpreter" information means there is no pure functional definition or derivation of badness. The best possible solution is a risk calculation based on code reputation with the error rate covered by some kind of event driven insurance. (Where you can't have certainty, you can at least have compensation.)
IMO, malware analysis is an imperfect information game that we have spent a lot of time trying to solve as a perfect one, and you can see how AV products are running up against these limits now.
Most of my work with n-grams has been for indexing purposes, meaning you need all the n-grams, not just the most frequent ones.
So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future.
That's definitely cool. It kind of feels like a different technique from n-grams in the end, though, since presumably just one chunk indicates a true positive? Whereas with n-grams you're generally looking for a fuzzier statistical match?
I'm wondering what other applications this might have besides malware detection.
>So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future.
You basically got it! We also find large chunks in benign files too. While just _one_ chunk is pretty indicative, performance is much better by using multiple. Thats where Logistic Regression (+ Lasso) comes in to tell us how important each chunk is to making a decision.
Near term I think there are some NLP applications for this (though not out to 1024 grams!), and I'm hopeful extensions to this work will be useful to bioinformatics problems.
I'm also interested in seeing what craziness people come up with now that large n-grams are an option! Everything I had done/learned before put n>6 as an immediate "why even think about it, you can't compute them" bucket.
The reason being that grammar is intensely hierarchical, so that mere "linear" processing that n-grams do stops being useful beyond things like compound words or short sayings like "don't mind if I do!"
I'd be interested to hear if bioinformaticians are interested in this work.
There may also be some applications for this in NLP. I'm chatting with a professor who thought this could help speed up some old algorithms that weren't scalable before.
I deal with malware from an operational perspective. I do use ML based proprietaty solutions that are crazy good at picking up on bad things. As someone who doesn't understand a lot of the n-gram and ML specific language in your paper,am I correct in sayig your work would basically allow developing better or more accurate true positive detections?
We think our tool will provide features that can improve AV systems to be better and more accurate, yes (see Figure 7)! We tested that in the last section of our paperLots of future work in that direction is planned.
(Actually, how do you compose? Is it the feature vector a union of all top-k features?)
Most of the work we actually don't use any fancy sketching. Just a GIANT table where we count hashes, and ignore collisions, making it super fast to count. Then we just use Quick-Select to find the top-k counts out of 2^31 buckets.
Second phase we technically use the Space-Saving algorithm, but its generally not needed - we could just use a hash-map and naively count (which we know is actually safe is the data is truly power-law distributed). The Space-Saving algo gives us some greater peace-of-mind though since real data doesn't have to obey our theories :)
Features are just bag-of-words style!
I would say that many times larger n-grams are tested in the early stage, give poor result, and then are dropped in subsequent tests.
The primary point of our paper is that now we can circle back and really test all these hypothesis about larger n-grams for malware! And even if you only want smaller 6-grams, we get a nice big speedup too :)