Hacker News new | past | comments | ask | show | jobs | submit login
KiloGrams: Large N-Grams for Malware Classification (arxiv.org)
51 points by adulau 15 days ago | hide | past | web | favorite | 14 comments

The project I never got around to was extracting n-grams, making them nodes of a graph, then testing for badness variants using cliques and isomorphisms. Part of the reason I didn't move on it was 'graphyness," did not appear to add information that could be obtained using other, faster, clustering methods with more mature tools, and much smarter people than me were working on solving it.

Reasoning about the interpreted behaviour of a program as "malware," is a really fun problem. Detecting known malware, variants, work-alikes, known behaviours, and then intent and "badness," are layers of abstraction that can't be solved bottom-up.

Badness is an interpreted reflection of the effect of executed code, so this "missing interpreter" information means there is no pure functional definition or derivation of badness. The best possible solution is a risk calculation based on code reputation with the error rate covered by some kind of event driven insurance. (Where you can't have certainty, you can at least have compensation.)

IMO, malware analysis is an imperfect information game that we have spent a lot of time trying to solve as a perfect one, and you can see how AV products are running up against these limits now.

This is certainly clever... but it's not about indexing all n-grams of a large size, only the most-frequent k n-grams.

Most of my work with n-grams has been for indexing purposes, meaning you need all the n-grams, not just the most frequent ones.

So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future.

That's definitely cool. It kind of feels like a different technique from n-grams in the end, though, since presumably just one chunk indicates a true positive? Whereas with n-grams you're generally looking for a fuzzier statistical match?

I'm wondering what other applications this might have besides malware detection.

Thanks! Yea, we "cheat" by restricting ourselves to the top-k & distributional assumptions. For our case, the low frequency grams are never used, so it makes sense.

>So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future.

You basically got it! We also find large chunks in benign files too. While just _one_ chunk is pretty indicative, performance is much better by using multiple. Thats where Logistic Regression (+ Lasso) comes in to tell us how important each chunk is to making a decision.

Near term I think there are some NLP applications for this (though not out to 1024 grams!), and I'm hopeful extensions to this work will be useful to bioinformatics problems.

I'm also interested in seeing what craziness people come up with now that large n-grams are an option! Everything I had done/learned before put n>6 as an immediate "why even think about it, you can't compute them" bucket.

I've done a lot of work with n-grams in NLP and in my experience it's only useful up to 4-6 words at a time, unless you're trying to index proverbs.

The reason being that grammar is intensely hierarchical, so that mere "linear" processing that n-grams do stops being useful beyond things like compound words or short sayings like "don't mind if I do!"

Oh, I totally agree with everything you've said! Some collaborators I'm chatting with have more niche NLP applications where larger n would be valuable though. I don't want to go into too much detail yet since it's their project idea, and not sure on their comfort level on blasting it out to the public yet :)

> I'm wondering what other applications this might have besides malware detection.

I'd be interested to hear if bioinformaticians are interested in this work.

Bioinformatics is one of the application areas we thought this might be useful! I've chatted with a few, and it appears the applications right _now_ might be more limited due to the need for exact matches. It's something we are thinking about for the future, and think it could be useful.

There may also be some applications for this in NLP. I'm chatting with a professor who thought this could help speed up some old algorithms that weren't scalable before.

Paper author here, happy to answer questions! I'll try and pop in as I have time in the day :)

Have you developed open source tooling I can test out?

I deal with malware from an operational perspective. I do use ML based proprietaty solutions that are crazy good at picking up on bad things. As someone who doesn't understand a lot of the n-gram and ML specific language in your paper,am I correct in sayig your work would basically allow developing better or more accurate true positive detections?

I want to open source it, but the government owns the code and bureaucracy is slow. We've talked about it, but it will move at its own pace.

We think our tool will provide features that can improve AV systems to be better and more accurate, yes (see Figure 7)! We tested that in the last section of our paperLots of future work in that direction is planned.

What’s your top-k algorithm? I’ve used count sketches and heaps (I read heavy-keeper strictly outperforms count/count-min). I’ve also been told some variant of a hierarchical count sketch provides composability, but I’m not sure what that is.

(Actually, how do you compose? Is it the feature vector a union of all top-k features?)

Top-k for which phase?

Most of the work we actually don't use any fancy sketching. Just a GIANT table where we count hashes, and ignore collisions, making it super fast to count. Then we just use Quick-Select to find the top-k counts out of 2^31 buckets.

Second phase we technically use the Space-Saving algorithm, but its generally not needed - we could just use a hash-map and naively count (which we know is actually safe is the data is truly power-law distributed). The Space-Saving algo gives us some greater peace-of-mind though since real data doesn't have to obey our theories :)

Features are just bag-of-words style!

>Larger values of n are not tested due to computational burden or the fear of overfitting

I would say that many times larger n-grams are tested in the early stage, give poor result, and then are dropped in subsequent tests.

Thats true in many cases! For our malware work and research, we've gotten a lot of consistent push-back that n-grams needed to be larger. A lot of older work (with much smaller datasets) said 15-20 grams were best, but that couldn't be tested again with a modern large corpus. A common source of that intuition was that x86 assembly is variable length, and a single instruction could be up to 15 bytes long.

The primary point of our paper is that now we can circle back and really test all these hypothesis about larger n-grams for malware! And even if you only want smaller 6-grams, we get a nice big speedup too :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact