

Naive Bayes and Text Classification I – Introduction and Theory - rasbt
http://sebastianraschka.com/Articles/2014_naive_bayes_1.html

======
syllogism
> Naive Bayes classifiers, a family of classifiers that are based on the
> popular Bayes’ probability theorem, are known for creating simple yet well
> performing models, especially in the fields of document classification and
> disease prediction.

But this is simply not true! They _don't_ perform well. There's really no
reason to teach people Naive Bayes any more, except as a footnote when
explaining log-linear/MaxEnt models.

MaxEnt is not so complicated, and it makes Naive Bayes fully obsolete. And if
MaxEnt is in some way too complicated/expensive, Averaged Perceptron is
generally much better than NB, can be implemented in 50 lines of Python, and
has far fewer hyper-parameters.

A common way for machine learning courses to suck is to teach students about a
bunch of crap, obsolete algorithms they should never use, simply for
historical reasons --- they used to be in the course, so they stay in the
course.

~~~
Dn_Ab
As others have mentioned, when the assumptions of conditional independence are
met, Gaussian NaiveBayes and MaxEnt will asymptotically learn the same
classifier. Except NB will learn it in O(log dimensions) instead of
O(dimensions) examples. So in those cases and where you have fewer examples
you'll want to use NB.

Even when the independence assumptions are not met, NB will often produce a
good classifier if you don't care about the accuracy of the probabilities. NB
is also online, resistant to the curse of dimensionality and for categorical
data will learn polynomial decision boundaries. So the gain in the trade-off
is in ease of implementation (you can write one in a dozen lines or less) that
you can throw something close to your raw data at and often get good enough
results without any ahead of time training, kernels or regularization.

You are right though that MaxEnt (some sparsity capturing algorithms are as
powerful as SVMs) will in general outperform naive bayes and that the averaged
perceptron probably has, _on average_ , the best performance/ease of
implementation ratio. But for many implementors/problems, where clarity is
paramount, those distinctions will not be worth the cost.

Also, you have a wonderful website.

~~~
instakill
What's his website? It's not in his bio.

~~~
andreasvc
[http://honnibal.wordpress.com/](http://honnibal.wordpress.com/)

------
rasbt
I posted this article some while ago, and I thought it may not be a bad idea
to archive it on arXiv. What do you think? The (Latex) PDF version would look
like this:
[http://sebastianraschka.com/PDFs/articles/naive_bayes_1.pdf](http://sebastianraschka.com/PDFs/articles/naive_bayes_1.pdf)
I just signed up on arXiv and see that the catch is that I'd need 2
recommendations for the categories Computer Science -> Learning or Statistics
-> Machine Learning. Would anyone here who would want to give me a
recommendation? I would really appreciate it!

------
dj-wonk
Re: "Empirical studies showed that the mutli-variate Bernoulli model is
inferior to the multinomial Bayes model, and the latter has been shown to
reduce the error by 27% on average [13]."

I remain skeptical of statements that claim that one kind of model is
categorically inferior. Also, I have a hard time believing this is a fair
summary of the literature. First, one study is cited, but the sentence implies
many.

Second, the citation in the original post is not a fair summary of the cited
research. The abstract of "A Comparison of Event Models for Naive Bayes Text
Classification" says: "This paper aims to clarify the confusion by describing
the differences and details of these two models, and by empirically comparing
their classification performance on five text corpora. We find that the multi-
variate Bernoulli performs well with small vocabulary sizes, but that the
multinomial performs usually performs even better at larger vocabulary
sizes—providing on average a 27% reduction in error over the multi-variate
Bernoulli model at any vocabulary size."

~~~
rasbt
Thanks for the comment, and I agree! I rephrased it a little bit to

"Empirical comparisons provide evidence that the multinomial model tends to
outperform the multi-variate Bernoulli model if the vocabulary size is
relatively large [13]. However, the performance of machine learning algorithms
is highly dependent on the appropriate choice of features. In the case of
naive Bayes classifiers and text classification, large differences in
performance can be attributed to the choices of stop word removal, stemming,
and token-length [14]. In practice, it is recommended that the choice between
a multi-variate Bernoulli or multinomial model for text classification should
precede comparative studies including different combinations of feature
extraction and selection steps."

~~~
dj-wonk
Thanks for the updates.

------
MojoJolo
I actually really like doing text classification using Naive Bayes. I'm still
new to it and still learning a lot. But one thing I'm having a hard time
answering is explaining Naive Bayes classification in simple terms.

If that was asked to you, how can you explain Naive Bayes classification in
simple terms?

~~~
yen223
I know a lot of spam has the word "Viagra", and very few legitimate emails
have the word "Viagra". So if an email contains the word "Viagra", it's more
likely to be spam. Naive Bayes classification just applies that to all words
in all your emails.

~~~
rasbt
I like this simple and intuitive explanation. Hm, maybe a short mentioning of
the "a priori" knowledge would be also worthwhile!

------
meaty_sausages
Lexicon-based approaches are better than all these approaches. The difference
between MaxEnt and NB is similar (the same?) as a binomial/multinomial
regression. They have all of the drawbacks and only a few of the benefits. A
decent word list would be about as accurate as a regression (70% or so).
Account for syntax and it will get you higher.

------
dj-wonk
There are typos:

2.6.3. The correct term is "Additive Smoothing"
[https://en.wikipedia.org/wiki/Additive_smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
not "smoothening".

3.3. Letters transposed. The author meant "Multi-variate".

~~~
rasbt
Thanks a lot, I always called it "smoothening" for whatever reason, but it
should be fixed now :)

------
en4bz
I'm currently working on a school project to classify abstracts from papers in
to 4 categories. Currently Bernoulli Naive Bayes scores 85% accuracy. For
comparison kNN scores about the same.

~~~
rasbt
Sounds cool. I think the most crucial part is really the data (or feature)
set. The choice of classifiers makes a difference too, but having good data
(salient, invariant, discriminatory) makes the largest difference.

One think about naive Bayes and kNN: Although the performance may be similar,
I would consider Naive beneficial if the data set is large. kNN requires to
re-evaluate the data. Naive Bayes makes the classification computationally
cheaper and with "on-line" learning, you can still improve or adjust it over
time

