The other thing is that simple thresholds aren't the only solution to using the output of naive Bayes for determining whether a message is ham or spam. Back in 2007 I looked at calibrating the output to deal especially with the problem of messages that have a probability near 0.5: http://blog.jgc.org/2007/03/calibrating-machine-learning-bas...
Also for spam filtering it's worth testing whether stop word removal and stemming are actually worthwhile: http://blog.jgc.org/2007/03/so-how-much-difference-do-stopwo...
Also on spam filtering, there's a note there about stemming not being optimal for spam, but the spam classifier was just an example, otherwise stemming is still useful when bootstrapping with a small dataset.
P(spam/doc) = P(w1/spam) * P(w2/spam) * ... * P(wn/spam)
P(w1) * P(w2) * ... * P(wn)
I ignore P(spam) because more often than not, I don't have the representative value for P(spam). If I am using a training dataset, P(spam) isn't chance - it is just the training data I used.
If I am given a value for it, I don't believe P(spam) is a constant for all sorts of data, and I think using it skews the results. Also, if you assume P(spam) is constant, the scale remains the same if the constant is dropped. That won't give us the actual probability, but this is naive bayes which assumes w1 and w2 are independent, when they are not. I don't think we get the actual probabilities in any case - that isn't an issue as we aren't concerned with actual probabilities but a heuristic which we can use to classify.
I tried using extended form of the bayes theorem, but in my sample runs using cross validation, it gave me worse results. Now, as a rule of thumb, I stick to Peter Norwig's advice - "If your training dataset is relevant, use the simplest algorithm and it will work. If your training data isn't accurate, even the complicated algorithms won't buy you much." In my experience, I always had better results with short bayes compared to extended bayes.
One missing aspect about this naive bayes implementation is it assumes all features are relevant. I have tried logistic regressions with regularization, and they tended to give me same or worse results. So, for most of the classification cases, bayes is my goto algorithm - I don't do any stemming or lemmatization or feature selection. I run it, and if results aren't relevant, then I re-examine the dataset. If they still aren't relevant, I go for feature selection. If they still aren't relevant, well, I mostly give up.
> consider "house" and "housing" ... the former is used less frequently in a spammy context then the later.
The point is that stemming is valuable unless P(spam|root) is substantially different from P(spam|derivedWord). But that data is available to you through Bayes. Has anyone tried a 'conditional stemming' system where the word is stemmed iff
|P(spam|root) - P(spam|derivedWord)| < someThreshold
? I wonder if that would improve classification accuracy. I suppose it might be a big performance hit on the stemming algorithm though.
Yes you are right, it's better to convert the whole thing to a sum of logs, otherwise you end up with floating-point underflow.
The article was getting already too long however, but I'll add a note about it because it is an important optimization that affects both speed and even correctness (because of the underlying limitations of floating point).
UPDATE: note added.
In the context of this article and implementation, the speed difference between + and * is utterly irrelevant and unmeasurable.
Some resources and other reference implementations that were useful in building it:
http://www.gigamonkeys.com/book/practical-a-spam-filter.html - Siebel's "Practical Common Lisp" book has a very readable implementation
http://www.paulgraham.com/spam.html - pg's essay that revived interest in using NBC for spam classification
http://spambayes.svn.sourceforge.net/viewvc/spambayes/trunk/... - a Python implementation that's been out there in the real world for about 10 years
https://github.com/WinnowTag/winnow/blob/master/src/classifi... - a C implementation used in another news classifier
I did something similar using dbacl.
Unfortunately, I never got around to releasing the code. However, it worked alright. The main problem was in the hassle it took to continuously train and retrain the classifier as my interests changed.
This is the main difference between articles you actually want to read and spam: spam is pretty much always going to remain spam. Your interests are never really going to change to the point that something you thought was spam one day would no longer be spam the next. But with articles, what you might have found uninteresting one day may become interesting the next. And what you thought was fascinating one day may become boring the next.
Thus, the need for continual training and re-training for spam is much less than it is for news articles and the like.
Because of this, the classification/reclassification process for news articles should be made as simple, quick and painless as possible. Otherwise it'll just be too much trouble to keep doing it day after day, week after week, month after month, and year after year.
 - http://www.lbreyer.com/dbacl.html
True. This is a real problem. Much better results can be obtained by partitioning the space of articles and predicting on more focused subsets. The partition can either be manual ("these documents are from my RSS feeds about Python programming") or automatic, via some clustering algorithm.
Ideally you identify subsets where your interests are less transient.
But the flipside of using a general purpose classifier is that you can train it to predict whatever you want. You're in the driver's seat. It doesn't have to be "interesting," it can be "most relevant to my primary area of research" or "most relevant to this book I'm writing" or "most relevant to my small business."
Communicating this capability to users is whole 'nother deal of course.
I did something similar using dbacl.
I remember looking at dbacl. There's some great information in the writeups, but I was disappointed that it coupled feature extraction to classification. For instance, what if you wanted to use stemmed terms as features? How good is the word tokenization for other languages? (compare to Xapian's tokenizer) Can it do named entity identification and use those as features? Can you toss in other features derived from external metadata? Etc.
We ended up doing all of these things, but our classifier itself remained a simple unix tool. For training you piped in (item,feature,category) tsv records. For classification, you piped in (item,feature) and it output (item,category,probability).
It's a very powerfull toolkit, with a lot more functionality than is needed to write an NB classifier, but may be of interest to anyone looking at NLP!
I wrote my own Naive Bayes function for helping me tag my blog posts back in January. I did a longish explanation and implementation (PHP), it'd be cool if someone wanted to check my math/intuition since while my button has worked out so far (that is, no really surprising results have crept to the top as most likely) I wouldn't be surprised if there's an error or justification for a better calculation of a particular probability term that I missed. http://www.thejach.com/view/2012/1/an_explanation_and_exampl...
In my experience, all variables are important, but on testing, using only the ones (i.e., the words) observed in the text to test yields the best effectiveness rates, following the implementation from Manning, Raghavan and Schütze (2008).
In addition, the considered Laplace smoothing is of utmost importance to deal with out-of-vocabulary words, thus avoiding the annoying 0's.
Here is a full project I started to to the same backed by redis adn built to scale for large applications: http://github.com/tawlk/synt
Example for possible results:
Lemmatizing: indices -> index
Stemming: indices -> indi
So you could perhaps do what you're suggesting, if you have a big database of emails tagged as 'interesting' or 'not interesting'.