
Machine Learning and Link Spam: My Brush With Insanity - a5seo
http://www.seomoz.org/blog/machine-learning-and-link-spam-my-brush-with-insanity
======
tlarkworthy
Oh my yes! Machine learning one of the hardest programming there is. You only
ever get an indirect measure if it is working correctly or not. Its hard to
debug. My algorithms gets it right 80%, have I made an implementation mistake?
Who knows?

My general strategy is to invest into training set curation and evaluation. I
also use quick scatter scatter plots to check _I_ can seperate the training
sets into classes easy. If its not easy to do by eye then the machine is not
magic and probably can't either. If I can't, then its time to rethink the
representation.

The author correctly underlines the importance of training set, but also
equally critical to have the right representation (the features). If you
project your data into the right space then pretty much any ML algorithm will
be able to learn on it. i.e. its more about what data you put in, rather than
the processor. k-means and decision trees FTW

EDIT: Oh and maybe very relevant is the importance of data normalization.
Naive Bayes classifiers require features to be conditionally independent. So
you have to PCA or ICA your data first (or both), otherwise features get
"counted" twice. E.g. every wedding related word counting equally toward spam
catagorization. PCA realizes which variables are highly correlated and
projects them into a common measure of "weddingness". Very easy with skilean
its preprocessing.Scaler() and turn whitening on.

~~~
zby
Agreed about features and Bayesian filters. Words are just not very good
features for filtering spam - but all the numeric data he could easily feed to
the Bayesian filter by dividing the data ranges into compartments (like very-
many-links-per-word, or over-100-links-per-word).

------
dvt
I've wanted to build this for a while, I think an SVM-based spam solution
could be amazing. Obviously, like the article mentions, when trying to
categorize spam, a purely Bayesian approach is not great -- and neither is an
ANN (although, with a large enough pool of hidden layers, it can get pretty
decent). I think that the issue lies in the problem set. Spam cannot be
treated like a linearly-separable model.

There are papers[1][2] that outline possible benefits of SVM-based spam
filtering. Unfortunately, SVMs are still in their infancy and not many people
know how to implement and use them. I do think the are the future, however.

[1]
[http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final...](http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final.pdf)

[2]
[http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/007886...](http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/00788645-SVMspam.pdf)

~~~
Radim
Assuming you mean Support Vector Machines with "SVM", you may be idealizing
them a bit.

SVMs have been around for almost two decades now, which is an eternity in the
ML world, rather than infancy.

SVMs don't require the problem set to be linearly separable.

Please note that there's a myriad of robust, scalable SVM implementations --
SVMlight, HeroSVM, LIBSVM, liblinear... (the latter two also have wrappers in
scikit-learn, a Python library mentioned in the OP).

~~~
notimetorelax
Would only add that unless you're writing a PHD on SVM don't write your own
implementation. As Radim wrote there are quite a few to chose from.

------
ZeroCoin
The problem now is that (imho) 99% of the links posted on the internet are
spam.

Unless you have a baseline of "what was here first" and "exactly when every
website went live with what links" like Google does (because they have been
indexing websites since the dawn of time as far as the internet and linking is
concerned. Heck, there wasn't even backlink spamming prior to Google because
Google was the first search engine to rank by number of backlinks!)... you're
going to have a really tough time determining what spam is and what it isn't.

Which 1% do you decide to focus in on?

~~~
dchuk
Just because a link is added to content after the content already exists
doesn't immediately qualify it as spam. 99% of links being spam is a pretty
massive assertion, is that anecdotal or backed by any actual data?

~~~
ZeroCoin
"doesn't immediately qualify it as spam", no. That's correct.

It does however make it very difficult to gauge who is a legitimate linker and
who is not.

All I'm really saying is if you're starting now with all of the years of
random linkspam backscatter is that you are in for a rough ride.

~~~
TomAnthony
Even if you have had a perfect history of when links appeared you're still in
the a rough ride. Furthermore, absence of the that information doesn't
invalidate the author's approach (but having it might improve its
effectiveness).

As an aside, you have use Ahrefs.com to get pretty decent tracking of when
links appeared since it started (I think ~18 months ago or so). Given that the
rate of spammy pages is increasing extremely fast and old spam pages are dying
off, I imagine that in the not too distant future you'll be able to get decent
link history for many sites.

------
btw0
I've built an anti-spam system for Delicious.com using Naive Bayes classifier
with a really huge feature database, think tens of millions, mostly tokens in
different parts of the page, those features are given different weights which
contribute to the final probability aggregation. The result was similar to
what the OP achieved - around 80% accuracy. The piece of work was really
interesting and satisfying.

~~~
JacobiX
Hmm, interesting ... but how you calculate the weights ? do you use the KL-
divergence method.

------
a_p
I'm surprised this post doesn't mention Markov chains. The author seems to
think that finding and implementing a grammar quality checker will help stop
spam. Aside from provides endless hours of entertainment viz.
DissociatedPress, Markov chains are abused by spammers to generate
grammatically correct nonsense. You can easily add meaning to the "nonsense"
by adding formatting to certain words to add a secondary message. Does anyone
know of a way to stop this?

~~~
law
You can estimate the likelihood that a particular sentence is spam by
calculating the log sum of n-gram probabilities of sub-sequences in a
sentence. These probabilities are obtained from a sufficiently general
training set, such as Google's n-gram viewer[1]. You can estimate the
probability of a particular sequence of words by summing the log probabilities
of each n-gram within that sequence. Using a trigram language model (n = 3),
you could estimate the likelihood as follows:

Sentence = "This sentence is semantically and syntactically valid."

P(Sentence) = log(p(START,START,This)) + log(p(START,This,sentence)) +
log(p(This,sentence,is)) + log(p(sentence,is,semantically)) +
log(p(is,semantically,and)) + log(p(semantically,and,syntactically)) +
log(p(and,syntactically,valid)) + log(p(syntactically,valid,.)) +
log(p(valid,.,STOP)) + log(p(.,STOP,STOP))

where START and STOP are special symbols that aid in determining the proximity
of a word to the beginning and end of a sentence.

If your training set fails to sufficiently generalize, you could use Bayesian
inference to estimate the likelihood that the sentence is spam. Under this
framework, you'd be calculating the posterior probability of the sentence
being spam given the observed sequence of n-grams, which combines (i) the
inherent likelihood that any sequence of words is spam and (ii) the
compatibility of an observed sequence with (i), which is proportional to the
impact it has on (i).

[1]
[http://storage.googleapis.com/books/ngrams/books/datasetsv2....](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

~~~
drakaal
Your comment would be marked as spam using your logic. Was that intentional?

------
drakaal
This is the really hard way.

And it is going to fail A LOT.

Do this instead:

1\. Contact a company that has a searchengine and therefore access to all your
links. ( <http://samuru.com> ) springs to mind.

2\. Do keyword extraction of those pages. Assume that anything that doesn't
have any of the keywords of the page that is being linked to is a Bad link.

3\. The ones that remain Google the keywords you extracted. (like 10 of the
words) if the linking page doesn't appear in the top 50 results it is probably
a Bad neighbor according to Google.

This method doesn't require NTLK, or Grammar checking. You can do it your
self, and you are using Google to tell you if the site is on the Bad Neighbor
list so you don't have to guess.

~~~
TomAnthony
Your approach is going to have a lot of problems.

One of the most linked to page on the internet is the download page for Adobe
Reader. It is definitely not spam but millions of those links aren't going to
have "the keyword" on the page, so by your logic are bad links. This is an
extreme example, but it is not an uncommon scenario.

Furthermore, if you have millions of backlinks, it becomes quite difficult to
scrape Google (but you can use services like Authority Labs).

~~~
drakaal
Why are you doing Bad Neighbor link checks if you make something like Adobe
Reader?

You don't have to scrape Google they have an API for Search that is about $10
per 1000 calls at volume.

I have done this for BILLIONs of links.

~~~
TomAnthony
You think the Adobe never built any bad links? I know many such large
companies that are spendings hundreds of thousands of dollars a year or more
buying links.

Do you have a link to the API, please? Thanks!

~~~
drakaal
Custom Search, Don't specify any rules which would interact with the results
you are testing against. (like add a rule that favors a parked domain.)

If they are buying those links then finding the bad ones is as easy as
contacting the people they cut checks to.

------
ZirconCode
I've tried doing something similar with AI the other day. My approach was
looking at money flow instead, as in theory, spammers only spam to make money.
I basically downloaded an ad-blocker list and ran it against a pages source.
That along with a couple of other factors were fed into many attempts of
machine learning fun. In the end, it all failed. I learned that it's just
impossible without a data-set like google's, so I went and build them into the
process, and voila, it worked.

~~~
notimetorelax
> I went and build them into the process

Sorry, what do you mean by that?

~~~
ZirconCode
My script basically looked up how high a link ranked on Google for various
related keywords. Alexa page rankings and similar services were included
aswell. The AI then weighed the factors and tried coming up with an educated
guess. After I included external numbers from people with big databases, it
became actually very successful.

~~~
visarga
I tried doing this with reddit links. I downloaded a dataset of 100K
submissions and their vote rank. I split them in two, those with less than 5
votes and those with more. It would seem that falls right in the median value.

Then I collected the HTML from those pages and turned it into a feature
vector, then tried to learn if a page would have less or more than 5 votes. My
prediction rate was as good as random. Fail.

Google does this since Panda. They machine-predict if a page would have
success or not with the people and use that as a ranking factor. It makes SEO
into a holistic art - you need to think of everything.

------
JacobiX
I have used maximum entropy classification for a quite similar task. It
achieves better performances than Naive Bayes classifiers. But as the author
remarked, the quality of the training set and the selection of features are
very important aspects too.

------
zaptheimpaler
The Bayesian filter might have worked. Theres no reason to use only content as
a feature - you can use all the features you want regardless of which ML
technique you apply. Bayesian poisoning is a real concern though.

