How To Build a Naive Bayes Classifier 264 points by zhiping on Feb 27, 2012 | hide | past | web | favorite | 37 comments

 Years ago I wrote about this for Dr. Dobbs. That article is here: http://drdobbs.com/article/print?articleId=184406064&sit... The important differences are that the DDJ article used log probabilities instead of simple probabilities because underflow is a real problem.The other thing is that simple thresholds aren't the only solution to using the output of naive Bayes for determining whether a message is ham or spam. Back in 2007 I looked at calibrating the output to deal especially with the problem of messages that have a probability near 0.5: http://blog.jgc.org/2007/03/calibrating-machine-learning-bas...Also for spam filtering it's worth testing whether stop word removal and stemming are actually worthwhile: http://blog.jgc.org/2007/03/so-how-much-difference-do-stopwo...
 You are right about log probabilities ... I added a note about it (I'm the author of the article).Also on spam filtering, there's a note there about stemming not being optimal for spam, but the spam classifier was just an example, otherwise stemming is still useful when bootstrapping with a small dataset.
 On a related note, I generally simplify the implementation further:`````` P(spam/doc) = P(w1/spam) * P(w2/spam) * ... * P(wn/spam) ------------------------------------------- P(w1) * P(w2) * ... * P(wn) `````` Of course, I do the log and anti-log dance(if non-log values are required, otherwise leave it at the log probabilities), or else both denominator and numerator tend to 0.I ignore P(spam) because more often than not, I don't have the representative value for P(spam). If I am using a training dataset, P(spam) isn't chance - it is just the training data I used.If I am given a value for it, I don't believe P(spam) is a constant for all sorts of data, and I think using it skews the results. Also, if you assume P(spam) is constant, the scale remains the same if the constant is dropped. That won't give us the actual probability, but this is naive bayes which assumes w1 and w2 are independent, when they are not. I don't think we get the actual probabilities in any case - that isn't an issue as we aren't concerned with actual probabilities but a heuristic which we can use to classify.I tried using extended form of the bayes theorem, but in my sample runs using cross validation, it gave me worse results. Now, as a rule of thumb, I stick to Peter Norwig's advice - "If your training dataset is relevant, use the simplest algorithm and it will work. If your training data isn't accurate, even the complicated algorithms won't buy you much." In my experience, I always had better results with short bayes compared to extended bayes.One missing aspect about this naive bayes implementation is it assumes all features are relevant. I have tried logistic regressions with regularization, and they tended to give me same or worse results. So, for most of the classification cases, bayes is my goto algorithm - I don't do any stemming or lemmatization or feature selection. I run it, and if results aren't relevant, then I re-examine the dataset. If they still aren't relevant, I go for feature selection. If they still aren't relevant, well, I mostly give up.
 I found that note very interesting:> consider "house" and "housing" ... the former is used less frequently in a spammy context then the later.The point is that stemming is valuable unless P(spam|root) is substantially different from P(spam|derivedWord). But that data is available to you through Bayes. Has anyone tried a 'conditional stemming' system where the word is stemmed iff|P(spam|root) - P(spam|derivedWord)| < someThreshold? I wonder if that would improve classification accuracy. I suppose it might be a big performance hit on the stemming algorithm though.
 Interesting idea. I bet within some specific application domain, stemmed probabilities either diverge or don't across the board, so it wouldn't pay to conditionally stem term-by-term.
 Yeah, probably. That's something that's fairly easy to measure for any given data set, so you can check to see if it's worthwhile before you do it.
 I see you are the author of POPFile! I ran that for years to take care of my spam problem. It worked brilliantly. Thanks a ton. Too bad Gmail replaced you.
 Glad it worked well for you. It had its day and now, happily, the sort of machine learning that was in POPFile is everywhere.
 Nice article, but I suspect the implementation will not work. I did essentially the same for an AI-class exercise, and was thrilled to see that you could write a working Bayes classifier in 60 short lines of Python code. But later I landed a free-lance job that required writing a classifier that could be applied to real world data, and I soon realized that repeated multiplication of numbers between 0 and 1 sends you to zero too fast for the implementation to actually work. I might have missed it in the code, but I think he's doing the same mistake: you need to normalize or move to logarithms for the estimation of probabilities to work for medium or large datasets.
 I'm the author of the article ...Yes you are right, it's better to convert the whole thing to a sum of logs, otherwise you end up with floating-point underflow.The article was getting already too long however, but I'll add a note about it because it is an important optimization that affects both speed and even correctness (because of the underlying limitations of floating point).UPDATE: note added.
 Is the sum of logs method mathematically equivalent to the multiplication of probabilities (i.e. will it always produce a monotonic ordering of class predictions)?
 log(AB) = log(A) + log(B) for all A, B real numbers
 Only for A, B > 0, but that's good enough for probabilities (except 0, I guess)
 That was there to test you...
 Even better, convert everything to log(p/(1-p)) log-odds, and halve the number of things to sum.
 I think this should also have the added effect of being overall faster, as doing lots of addition would be quicker than doing lots of multiplication (log notwithstanding).
 That could be true in a hardcore numerical computation, assuming you know enough about your architecture and compiler to make use of it.In the context of this article and implementation, the speed difference between + and * is utterly irrelevant and unmeasurable.
 On Ruby you can use the Decimal class, although it's really slow
 We wrote a Naive Bayes Classifier for http://linktamer.com/ that learns what news articles you find interesting. It's in C and uses a cdb for the database of frequencies, so it's pretty darn fast. Maybe we'll throw it up on github one of these days.Some resources and other reference implementations that were useful in building it:http://www.gigamonkeys.com/book/practical-a-spam-filter.html - Siebel's "Practical Common Lisp" book has a very readable implementationhttp://www.paulgraham.com/spam.html - pg's essay that revived interest in using NBC for spam classificationhttp://spambayes.svn.sourceforge.net/viewvc/spambayes/trunk/... - a Python implementation that's been out there in the real world for about 10 yearshttps://github.com/WinnowTag/winnow/blob/master/src/classifi... - a C implementation used in another news classifier
 "We wrote a Naive Bayes Classifier for http://linktamer.com/ that learns what news articles you find interesting."I did something similar using dbacl.[1]Unfortunately, I never got around to releasing the code. However, it worked alright. The main problem was in the hassle it took to continuously train and retrain the classifier as my interests changed.This is the main difference between articles you actually want to read and spam: spam is pretty much always going to remain spam. Your interests are never really going to change to the point that something you thought was spam one day would no longer be spam the next. But with articles, what you might have found uninteresting one day may become interesting the next. And what you thought was fascinating one day may become boring the next.Thus, the need for continual training and re-training for spam is much less than it is for news articles and the like.Because of this, the classification/reclassification process for news articles should be made as simple, quick and painless as possible. Otherwise it'll just be too much trouble to keep doing it day after day, week after week, month after month, and year after year.
 But with articles, what you might have found uninteresting one day may become interesting the next. And what you thought was fascinating one day may become boring the next.True. This is a real problem. Much better results can be obtained by partitioning the space of articles and predicting on more focused subsets. The partition can either be manual ("these documents are from my RSS feeds about Python programming") or automatic, via some clustering algorithm.Ideally you identify subsets where your interests are less transient.But the flipside of using a general purpose classifier is that you can train it to predict whatever you want. You're in the driver's seat. It doesn't have to be "interesting," it can be "most relevant to my primary area of research" or "most relevant to this book I'm writing" or "most relevant to my small business."Communicating this capability to users is whole 'nother deal of course.I did something similar using dbacl.I remember looking at dbacl. There's some great information in the writeups, but I was disappointed that it coupled feature extraction to classification. For instance, what if you wanted to use stemmed terms as features? How good is the word tokenization for other languages? (compare to Xapian's tokenizer) Can it do named entity identification and use those as features? Can you toss in other features derived from external metadata? Etc.We ended up doing all of these things, but our classifier itself remained a simple unix tool. For training you piped in (item,feature,category) tsv records. For classification, you piped in (item,feature) and it output (item,category,probability).
 Another useful toolkit for stuff like this (and anything to do with natural language), but only available for python at the moment is the Natural Language Toolkit: http://code.google.com/p/nltk/It's a very powerfull toolkit, with a lot more functionality than is needed to write an NB classifier, but may be of interest to anyone looking at NLP!
 This is a good explanation, though like the Dr. Dobbs article linked on this page I prefer the exposition of logs. (Looks like the author updated for it.) I also have a personal displeasure of Venn diagram and set-based and event-based versions of probability, and non-conditional probabilities rearing their heads, but that attitude comes largely from reading Jaynes...I wrote my own Naive Bayes function for helping me tag my blog posts back in January. I did a longish explanation and implementation (PHP), it'd be cool if someone wanted to check my math/intuition since while my button has worked out so far (that is, no really surprising results have crept to the top as most likely) I wouldn't be surprised if there's an error or justification for a better calculation of a particular probability term that I missed. http://www.thejach.com/view/2012/1/an_explanation_and_exampl...
 Really enjoyed that read. I'm looking at implementing some simple spam detection for personal messages that are sent between users of my application. (They are enquiries for sales and users are billed per enquiry so it's important to make sure that they don't get billed for spam), do you think Akismet would be suitable for this kind of thing? Any alternatives that you could recommend, I will probably get around to building my own at some point, but as a proof of concept I'd like to just get something running.
 A few months ago I tried to apply what is basically the same code (written in C#) to Twitter to see if I could filter the spam messages I receive. It's a great exercise for those interested in playing with the algorithm.
 Pretty good explanation - my only gripe is that all bayes classifier tutorials like this build the 'spam detector' type that's specialized to text. Though it's a common use case, that isn't the only thing that the classifier can do - you can use it for raster classification, other types of predictions, etc: and building a non-specialized version would make this point more clear.
 In response to some of the comments:In my experience, all variables are important, but on testing, using only the ones (i.e., the words) observed in the text to test yields the best effectiveness rates, following the implementation from Manning, Raghavan and Schütze (2008).In addition, the considered Laplace smoothing is of utmost importance to deal with out-of-vocabulary words, thus avoiding the annoying 0's.My implementation:
 Here are the basics of putting together a Naive Bayes sentiment classifier with NLTK https://gist.github.com/1266556Here is a full project I started to to the same backed by redis adn built to scale for large applications: http://github.com/tawlk/synt
 One thing which is missing from this implementation is the use of lemmas. Rather than treating words like "house", "houses", "housing" all as separate terms, they all get reduced to the stem "house". http://lemmatise.ijs.si/ is a good resource for this.
 He covers this in some detail, under the term stemming (which I've heard more often in this context). He even makes a library recommendation.
 Lemmatizing and stemming are similar, but not the same.Example for possible results:Lemmatizing: indices -> indexStemming: indices -> indi
 It's probably worth noting that in this context stemming is probably most appropriate. Lemmatizing is unlikely to give better results and in most implementations you suffer a significant performance hit.
 Btw, in case you want a good lemmatisation implementation, there's one in NLTK that uses the WordNet database.
 Thanks. It's refreshing to see algorithms written using real variable names and your tutorial is made much better because of this.
 Great article, all these Bayes articles have really piqued my interest in Machine Learning. Love it! Good stuff!
 can you build one to dig for interesting stuff in: http://pastebin.com/D7sR4zhT ?
 The algorithm needs training data. In this example you have a load of emails where you've explicitly tagged each one as either spam or ham.So you could perhaps do what you're suggesting, if you have a big database of emails tagged as 'interesting' or 'not interesting'.

Search: