
Utilizing NLP to Detect APT in DNS - philip1209
https://labs.opendns.com/2015/03/05/nlp-apt-dns/
======
jedisct1
The previous system at OpenDNS was using TRE agrep with a simple regex
matching names commonly used for phishing.

The output had a high amount of false positives (SEO, actual payment pages),
but still spotted a decent amount of true positives. There was then a script
trying to load the web page, and a simple classifier making a prediction based
on the content. Classic phishing pages tend to add some obfuscation to avoid
Google Safe Browsing and other phishing detection systems. As a result, web
pages have little HTML tags; the DOM is built using obfuscated JS. Which is
easy to reliably distinguish from minified scripts on a benign page based on
the presence of actual content besides JS in the same document.

Even if it only worked with some specific yet common phishing pages, the
amount of false positive with this system was very low.

The main problem is that many phishing pages are not hosted at the root path.
Sometimes you get the phishing page from /, or a redirection from
hxxp://phish/ to hxxp://phish/the/actual/path but most of the time, knowing
the actual path is needed. And the DNS logs don't have this information.
Compromised web sites are also commonly used for phishing and DNS doesn't help
much either.

------
clwg
This seems closer to Levenshtein distance than NLP classification. You could
also get similar results from a soundex comparison against domain names
(substituting numbers for letters when appropriate and breaking up/combining
subdomains).

Even with categorization based on ASN's, there is still a high risk of false
positives that would make this more a analytical process as opposed to
automated inline blocking approach.

