

How To Build a Naive Bayes Classifier - zhiping
http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html

======
jgrahamc
Years ago I wrote about this for Dr. Dobbs. That article is here:
[http://drdobbs.com/article/print?articleId=184406064&sit...](http://drdobbs.com/article/print?articleId=184406064&siteSectionName=)
The important differences are that the DDJ article used log probabilities
instead of simple probabilities because underflow is a real problem.

The other thing is that simple thresholds aren't the only solution to using
the output of naive Bayes for determining whether a message is ham or spam.
Back in 2007 I looked at calibrating the output to deal especially with the
problem of messages that have a probability near 0.5:
[http://blog.jgc.org/2007/03/calibrating-machine-learning-
bas...](http://blog.jgc.org/2007/03/calibrating-machine-learning-based-
spam.html)

Also for spam filtering it's worth testing whether stop word removal and
stemming are actually worthwhile: [http://blog.jgc.org/2007/03/so-how-much-
difference-do-stopwo...](http://blog.jgc.org/2007/03/so-how-much-difference-
do-stopwords.html)

~~~
bad_user
You are right about log probabilities ... I added a note about it (I'm the
author of the article).

Also on spam filtering, there's a note there about stemming not being optimal
for spam, but the spam classifier was just an example, otherwise stemming is
still useful when bootstrapping with a small dataset.

~~~
robertskmiles
I found that note very interesting:

> consider "house" and "housing" ... the former is used less frequently in a
> spammy context then the later.

The point is that stemming is valuable unless P(spam|root) is substantially
different from P(spam|derivedWord). But that data is available to you through
Bayes. Has anyone tried a 'conditional stemming' system where the word is
stemmed iff

|P(spam|root) - P(spam|derivedWord)| < someThreshold

? I wonder if that would improve classification accuracy. I suppose it might
be a big performance hit on the stemming algorithm though.

~~~
sigil
Interesting idea. I bet within some specific application domain, stemmed
probabilities either diverge or don't across the board, so it wouldn't pay to
conditionally stem term-by-term.

~~~
robertskmiles
Yeah, probably. That's something that's fairly easy to measure for any given
data set, so you can check to see if it's worthwhile before you do it.

------
juanre
Nice article, but I suspect the implementation will not work. I did
essentially the same for an AI-class exercise, and was thrilled to see that
you could write a working Bayes classifier in 60 short lines of Python code.
But later I landed a free-lance job that required writing a classifier that
could be applied to real world data, and I soon realized that repeated
multiplication of numbers between 0 and 1 sends you to zero too fast for the
implementation to actually work. I might have missed it in the code, but I
think he's doing the same mistake: you need to normalize or move to logarithms
for the estimation of probabilities to work for medium or large datasets.

~~~
bad_user
I'm the author of the article ...

Yes you are right, it's better to convert the whole thing to a sum of logs,
otherwise you end up with floating-point underflow.

The article was getting already too long however, but I'll add a note about it
because it is an important optimization that affects both speed and even
correctness (because of the underlying limitations of floating point).

UPDATE: note added.

~~~
ppod
Is the sum of logs method mathematically equivalent to the multiplication of
probabilities (i.e. will it always produce a monotonic ordering of class
predictions)?

~~~
alexchamberlain
log(AB) = log(A) + log(B) for all A, B real numbers

~~~
phaedon
Only for A, B > 0, but that's good enough for probabilities (except 0, I
guess)

~~~
alexchamberlain
That was there to test you...

------
sigil
We wrote a Naive Bayes Classifier for <http://linktamer.com/> that learns what
news articles you find interesting. It's in C and uses a cdb for the database
of frequencies, so it's pretty darn fast. Maybe we'll throw it up on github
one of these days.

Some resources and other reference implementations that were useful in
building it:

<http://www.gigamonkeys.com/book/practical-a-spam-filter.html> \- Siebel's
"Practical Common Lisp" book has a very readable implementation

<http://www.paulgraham.com/spam.html> \- pg's essay that revived interest in
using NBC for spam classification

[http://spambayes.svn.sourceforge.net/viewvc/spambayes/trunk/...](http://spambayes.svn.sourceforge.net/viewvc/spambayes/trunk/spambayes/spambayes/classifier.py?revision=3270&view=markup)
\- a Python implementation that's been out there in the real world for about
10 years

[https://github.com/WinnowTag/winnow/blob/master/src/classifi...](https://github.com/WinnowTag/winnow/blob/master/src/classifier.c)
\- a C implementation used in another news classifier

~~~
gnosis
_"We wrote a Naive Bayes Classifier for<http://linktamer.com/> that learns
what news articles you find interesting."_

I did something similar using dbacl.[1]

Unfortunately, I never got around to releasing the code. However, it worked
alright. The main problem was in the hassle it took to continuously train and
retrain the classifier as my interests changed.

This is the main difference between articles you actually want to read and
spam: spam is pretty much always going to remain spam. Your interests are
never really going to change to the point that something you thought was spam
one day would no longer be spam the next. But with articles, what you might
have found uninteresting one day may become interesting the next. And what you
thought was fascinating one day may become boring the next.

Thus, the need for continual training and re-training for spam is much less
than it is for news articles and the like.

Because of this, the classification/reclassification process for news articles
should be made as simple, quick and painless as possible. Otherwise it'll just
be too much trouble to keep doing it day after day, week after week, month
after month, and year after year.

[1] - <http://www.lbreyer.com/dbacl.html>

~~~
sigil
_But with articles, what you might have found uninteresting one day may become
interesting the next. And what you thought was fascinating one day may become
boring the next._

True. This is a real problem. Much better results can be obtained by
partitioning the space of articles and predicting on more focused subsets. The
partition can either be manual ("these documents are from my RSS feeds about
Python programming") or automatic, via some clustering algorithm.

Ideally you identify subsets where your interests are less transient.

But the flipside of using a general purpose classifier is that you can train
it to predict whatever you want. You're in the driver's seat. It doesn't have
to be "interesting," it can be "most relevant to my primary area of research"
or "most relevant to this book I'm writing" or "most relevant to my small
business."

Communicating this capability to users is whole 'nother deal of course.

 _I did something similar using dbacl._

I remember looking at dbacl. There's some great information in the writeups,
but I was disappointed that it coupled feature extraction to classification.
For instance, what if you wanted to use stemmed terms as features? How good is
the word tokenization for other languages? (compare to Xapian's tokenizer) Can
it do named entity identification and use those as features? Can you toss in
other features derived from external metadata? Etc.

We ended up doing all of these things, but our classifier itself remained a
simple unix tool. For training you piped in (item,feature,category) tsv
records. For classification, you piped in (item,feature) and it output
(item,category,probability).

------
shearn89
Another useful toolkit for stuff like this (and anything to do with natural
language), but only available for python at the moment is the Natural Language
Toolkit: <http://code.google.com/p/nltk/>

It's a very powerfull toolkit, with a lot more functionality than is needed to
write an NB classifier, but may be of interest to anyone looking at NLP!

------
Jach
This is a good explanation, though like the Dr. Dobbs article linked on this
page I prefer the exposition of logs. (Looks like the author updated for it.)
I also have a personal displeasure of Venn diagram and set-based and event-
based versions of probability, and non-conditional probabilities rearing their
heads, but that attitude comes largely from reading Jaynes...

I wrote my own Naive Bayes function for helping me tag my blog posts back in
January. I did a longish explanation and implementation (PHP), it'd be cool if
someone wanted to check my math/intuition since while my button has worked out
so far (that is, no really surprising results have crept to the top as most
likely) I wouldn't be surprised if there's an error or justification for a
better calculation of a particular probability term that I missed.
[http://www.thejach.com/view/2012/1/an_explanation_and_exampl...](http://www.thejach.com/view/2012/1/an_explanation_and_example_of_naive_bayes)

------
chrisacky
Really enjoyed that read. I'm looking at implementing some simple spam
detection for personal messages that are sent between users of my application.
(They are enquiries for sales and users are billed per enquiry so it's
important to make sure that they don't get billed for spam), do you think
Akismet would be suitable for this kind of thing? Any alternatives that you
could recommend, I will probably get around to building my own at some point,
but as a proof of concept I'd like to just get something running.

~~~
EnderMB
A few months ago I tried to apply what is basically the same code (written in
C#) to Twitter to see if I could filter the spam messages I receive. It's a
great exercise for those interested in playing with the algorithm.

------
tmcw
Pretty good explanation - my only gripe is that all bayes classifier tutorials
like this build the 'spam detector' type that's specialized to text. Though
it's a common use case, that isn't the only thing that the classifier can do -
you can use it for raster classification, other types of predictions, etc: and
building a non-specialized version would make this point more clear.

------
atrilla
In response to some of the comments:

In my experience, all variables are important, but on testing, using only the
ones (i.e., the words) observed in the text to test yields the best
effectiveness rates, following the implementation from Manning, Raghavan and
Schütze (2008).

In addition, the considered Laplace smoothing is of utmost importance to deal
with out-of-vocabulary words, thus avoiding the annoying 0's.

My implementation:

[https://github.com/atrilla/nlptools/blob/master/core/classif...](https://github.com/atrilla/nlptools/blob/master/core/classification/MultinomialNaiveBayes.php)

------
lrvick
Here are the basics of putting together a Naive Bayes sentiment classifier
with NLTK <https://gist.github.com/1266556>

Here is a full project I started to to the same backed by redis adn built to
scale for large applications: <http://github.com/tawlk/synt>

------
redact207
One thing which is missing from this implementation is the use of lemmas.
Rather than treating words like "house", "houses", "housing" all as separate
terms, they all get reduced to the stem "house". <http://lemmatise.ijs.si/> is
a good resource for this.

~~~
mmavnn
He covers this in some detail, under the term stemming (which I've heard more
often in this context). He even makes a library recommendation.

~~~
lignuist
Lemmatizing and stemming are similar, but not the same.

Example for possible results:

Lemmatizing: indices -> index

Stemming: indices -> indi

[http://en.wikipedia.org/wiki/Stemming#Lemmatisation_algorith...](http://en.wikipedia.org/wiki/Stemming#Lemmatisation_algorithms)

~~~
nl
It's probably worth noting that in this context stemming is probably most
appropriate. Lemmatizing is unlikely to give better results and in most
implementations you suffer a significant performance hit.

------
nailer
Thanks. It's refreshing to see algorithms written using real variable names
and your tutorial is made much better because of this.

------
jgmmo
Great article, all these Bayes articles have really piqued my interest in
Machine Learning. Love it! Good stuff!

------
B0rG
can you build one to dig for interesting stuff in:
<http://pastebin.com/D7sR4zhT> ?

~~~
robertskmiles
The algorithm needs training data. In this example you have a load of emails
where you've explicitly tagged each one as either spam or ham.

So you could perhaps do what you're suggesting, if you have a big database of
emails tagged as 'interesting' or 'not interesting'.

