
KiloGrams: Large N-Grams for Malware Classification - adulau
https://arxiv.org/abs/1908.00200
======
motohagiography
The project I never got around to was extracting n-grams, making them nodes of
a graph, then testing for badness variants using cliques and isomorphisms.
Part of the reason I didn't move on it was 'graphyness," did not appear to add
information that could be obtained using other, faster, clustering methods
with more mature tools, and much smarter people than me were working on
solving it.

Reasoning about the interpreted behaviour of a program as "malware," is a
really fun problem. Detecting known malware, variants, work-alikes, known
behaviours, and then intent and "badness," are layers of abstraction that
can't be solved bottom-up.

Badness is an interpreted reflection of the effect of executed code, so this
"missing interpreter" information means there is no pure functional definition
or derivation of badness. The best possible solution is a risk calculation
based on code reputation with the error rate covered by some kind of event
driven insurance. (Where you can't have certainty, you can at least have
compensation.)

IMO, malware analysis is an imperfect information game that we have spent a
lot of time trying to solve as a perfect one, and you can see how AV products
are running up against these limits now.

------
crazygringo
This is certainly clever... but it's not about indexing _all_ n-grams of a
large size, only the most-frequent _k_ n-grams.

Most of my work with n-grams has been for indexing purposes, meaning you need
_all_ the n-grams, not just the most frequent ones.

So if I understand correctly... the application here is to take a large number
of programs known to be infected with the same malware, and then run this to
find the _large_ chunks in common that will be more reliable as a malware
signature in the future.

That's definitely cool. It kind of feels like a different technique from
n-grams in the end, though, since presumably just _one_ chunk indicates a true
positive? Whereas with n-grams you're generally looking for a fuzzier
statistical match?

I'm wondering what other applications this might have besides malware
detection.

~~~
EdwardRaff
Thanks! Yea, we "cheat" by restricting ourselves to the top-k & distributional
assumptions. For our case, the low frequency grams are never used, so it makes
sense.

>So if I understand correctly... the application here is to take a large
number of programs known to be infected with the same malware, and then run
this to find the large chunks in common that will be more reliable as a
malware signature in the future.

You basically got it! We also find large chunks in benign files too. While
just _one_ chunk is pretty indicative, performance is much better by using
multiple. Thats where Logistic Regression (+ Lasso) comes in to tell us how
important each chunk is to making a decision.

Near term I think there are some NLP applications for this (though not out to
1024 grams!), and I'm hopeful extensions to this work will be useful to
bioinformatics problems.

I'm also interested in seeing what craziness people come up with now that
large n-grams are an option! Everything I had done/learned before put n>6 as
an immediate "why even think about it, you can't compute them" bucket.

~~~
crazygringo
I've done a lot of work with n-grams in NLP and in my experience it's only
useful up to 4-6 words at a time, unless you're trying to index proverbs.

The reason being that grammar is intensely hierarchical, so that mere "linear"
processing that n-grams do stops being useful beyond things like compound
words or short sayings like "don't mind if I do!"

~~~
EdwardRaff
Oh, I totally agree with everything you've said! Some collaborators I'm
chatting with have more niche NLP applications where larger n would be
valuable though. I don't want to go into too much detail yet since it's their
project idea, and not sure on their comfort level on blasting it out to the
public yet :)

------
EdwardRaff
Paper author here, happy to answer questions! I'll try and pop in as I have
time in the day :)

~~~
badrabbit
Have you developed open source tooling I can test out?

I deal with malware from an operational perspective. I do use ML based
proprietaty solutions that are crazy good at picking up on bad things. As
someone who doesn't understand a lot of the n-gram and ML specific language in
your paper,am I correct in sayig your work would basically allow developing
better or more accurate true positive detections?

~~~
EdwardRaff
I want to open source it, but the government owns the code and bureaucracy is
slow. We've talked about it, but it will move at its own pace.

We think our tool will provide features that can improve AV systems to be
better and more accurate, yes (see Figure 7)! We tested that in the last
section of our paperLots of future work in that direction is planned.

------
woliveirajr
>Larger values of n are not tested due to computational burden or the fear of
overfitting

I would say that many times larger n-grams are tested in the early stage, give
poor result, and then are dropped in subsequent tests.

~~~
EdwardRaff
Thats true in many cases! For our malware work and research, we've gotten a
lot of consistent push-back that n-grams needed to be larger. A lot of older
work (with much smaller datasets) said 15-20 grams were best, but that
couldn't be tested again with a modern large corpus. A common source of that
intuition was that x86 assembly is variable length, and a single instruction
could be up to 15 bytes long.

The primary point of our paper is that now we can circle back and really test
all these hypothesis about larger n-grams for malware! And even if you only
want smaller 6-grams, we get a nice big speedup too :)

