
A practical explanation of a Naive Bayes classifier - feconroses
https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/
======
moultano
A practical issue for Naive Bayes that also infects linear models is bias
w.r.t. document length. Typically when you are detecting a rare, relatively
compact class such as sports articles (or spam) you will tend to have a
strongly negative prior, many positive features, and few negative ones. As a
consequence, as the length of your text increases, not only does the variance
of your prediction increase, but the mean tends to as well. This leads to all
very long documents being classified as positive, regardless of their text.
You can observe this by training your model and then classifying
/usr/dict/words.

This is the most common mistake I've seen in production use of linear models
on document text. Invariably, they'll misfire on any unusually long document.

~~~
intune
Is there some way to normalize the document length?

~~~
moultano
Lots of reasonable hacks.

1\. Use only the beginning of the document, as that's probably the most
important part anyways, and it's fast.

2\. Divide the sum of your feature scores by sqrt(n) to give it constant
variance, and hopefully keep it comparable with your prior.

3\. Split the doc into reasonably sized chunks, and average their scores
rather than adding them.

~~~
geezerjay
> 1\. Use only the beginning of the document, as that's probably the most
> important part anyways, and it's fast.

That seems to be a solution devised for news articles, as the standard news
writing style involves providing answers to the Five Ws up front on the
article.

------
superasn
I created a small program that finds the best sub-reddit given any title
text[1] using this algorithm.

I'm a total ML noob but it was a interesting project and the results were
pretty accurate.

I basically used reddit's Bigquery data for the dataset (it's huge!). If you
need a practical example of this algo, the algorithm and code is here[2].

[1]
[https://storage.googleapis.com/superasn/script.html](https://storage.googleapis.com/superasn/script.html)

[2]
[https://www.reddit.com/r/learnmachinelearning/comments/6hqd6...](https://www.reddit.com/r/learnmachinelearning/comments/6hqd6o/p_automatic_reddit_categorizer_update_first/)

------
mooman219
The examples other people are using are fairly narrow. I would like to
substantiate that text categorization via naive bayes classifier is
surprisingly accurate and simple. This paper[1] uses ngrams and a simple out
of place measure to compare articles against different verticals and often
sees greater than 99% accuracy for relatively small blocks of text. The out of
place measure also adds a penalty to features not found in the document, which
helps establish the individuality for the category classification. Raw
matching performance is also fairly impressive; A less naive implementation is
also highly parallelizable.

[1]
[http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf](http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf)

~~~
abhgh
From experience, I suspect NB does well on text primarily for two reasons:

(1) High dimensionality - this is also the reason why SVMs with linear kernels
do so well on text (NBs find a linear boundary too - but a different linear
boundary than linear-kernel SVMs)

(2) this reason applies to certain cases only: text where "token" based
estimates are sufficiently discriminatory. For ex take a dataset that has two
kinds of documents - one talking about product A and the other talking about
product B. And you want to label the documents based on which product they're
talking about. Here, just noticing if the term "A" or "B" appears in the
document is good enough for classification. You don't have to have powerful
models that infer "connotations" of words based on context. NB will do well
here, esp if bundled with a feature selection technique that weeds out noisy
features (like sthg based on Normalized Mutual Information)

------
schuetze
Considering the relative ease of implementation, classification accuracy with
smaller datasets, and computational efficiency of Naive Bayes classifiers, I
am surprised that they are not mentioned as often as other machine learning
competitors, such as random forest.

Are there major drawbacks to Naive Bayes classifiers? Is it just that they
aren't as accurate on large datasets?

~~~
j2kun
Boosted decision trees account for >50% of all Kaggle winners. That is the
real surprise, since people rarely talk about boosted decision trees.

~~~
abhgh
Deep Learning is in vogue so I think we _see_ a lot around it written about in
blogs, media etc. GBDTs are used in the industry though.

------
rgarreta
I would add that another practical aspect about Naive Bayes classifiers is
that you can make use of the conditional probabilities for each feature that
contributes to the predictions. That gives you some introspection on how the
model is working and it's useful when "debugging" classifiers by finding
features that should/shouldn't be used.

[https://monkeylearn.com/blog/how-to-create-text-
classifiers-...](https://monkeylearn.com/blog/how-to-create-text-classifiers-
machine-learning/#keyword-cloud)

~~~
abhgh
A side note here: the classification probabilities NB produces are not very
accurate and usually need some correction using a process called
"calibration".

------
mattbettinson
[https://github.com/bettinson/bayesian-tag-
suggestion](https://github.com/bettinson/bayesian-tag-suggestion)

I wrote one of these in Ruby to classify links into tags! Was fun. I think it
got me a job.

------
torbjorn
I just did an jupyter notebook on Naive Bayes for Siraj Ravel's Math of
Intelligence YouTube course.

[https://github.com/NoahLidell/math-of-
intelligence/blob/mast...](https://github.com/NoahLidell/math-of-
intelligence/blob/master/probability_theory/bayesian-classification.ipynb)

I used naive bayes to classify raps from Biggie and 2pac.

------
b_ttercup
Is Naive Bayes really ever the most practical choice? Yes it is a simple, fast
algorithm, but it's usually a non trivial step below other simple models in my
experience and doesn't seem to show any major advantages. The results shown
here seem good but bag of words models usually do better than you might think
on supervised NLP. So what's the motivation?

~~~
Houshalter
The scikit-learn flowchart recommends it for text data with less than 100k
samples when linear SVC doesn't work: [http://scikit-
learn.org/stable/tutorial/machine_learning_map...](http://scikit-
learn.org/stable/tutorial/machine_learning_map/index.html)

AFAIK it's by far the fastest machine learning method and one of the only ones
that can be learned "online". I.e. it can just update the model each time it
gets a datapoint, and then throw it away without saving it for future
training. These are nice properties if you are doing something at a very large
scale or in an environment with very limited resources.

And if your data happens to actually meet the naive bayes assumptions (that
all the features are conditionally independent) then it's literally
mathematically optimal and you can't do any better than it. It seems to work
fairly well even when that isn't the case though.

~~~
phunge
Logistic regression can easily be made online too, keep in mind! sklearn has
an implementation of online gradient descent, and vowpal wabbit is also
excellent at those problems.

Naive bayes can be parallelized in ways that SGD can't, that's a whole other
conversation.

~~~
Houshalter
Gradient descent can be made online. But it's very slow and suffers from
catastrophic forgetting. Typical gradient descent needs to iterate over the
dataset many times, while naive Bayes only needs one pass.

------
anthonysarkis
C++ implementation for self driving car (school project)
[https://github.com/swirlingsand/self-driving-car-
nanodegree-...](https://github.com/swirlingsand/self-driving-car-nanodegree-
nd013/blob/master/p11/naive_bayes_cpp/classifier.cpp)

------
drefgert
Bayee does a great job of filtering spam but it drives me nuts that _obvious_
spam still appears in my inbox.

Suffix trees would fix this in most cases, so why the heck isn't spam
filtering using them to remove the obvious spam?

