
Bayes Theorem and Naive Bayes - alexhwoods
http://alexhwoods.com/2015/11/08/bayes-theorem-and-naive-bayes/
======
joe_the_user
I've heard Naives Bayes described as "nearly good enough" for many uses. Can
anyone quantify how much worse it is for various applications?

~~~
tunesmith
Naive Bayes assumes variable independence, for one thing. If the features
aren't independent from each other, then you'd be better off using
probabilistic graphical networks.

~~~
ced
> Naive Bayes assumes variable independence, for one thing

That's true, but every model makes assumptions that are wrong. The puzzling
thing about Naive Bayes was how well it performs in practice in spite of its
assumptions being so wrong. I believe that there have been papers explaining
this. I would look at Russell and Norvig's book for a start.

One thing that Naive Bayes sucks at is providing good probabilistic estimates.
It is nearly always overconfident in its predictions (eg. P(rain) =
0.99999999) even though its classification accuracy can be pretty good
(relative to its simplicity). Logistic regression fares a lot better for
probabilities.

~~~
_dps
> One thing that Naive Bayes sucks at is providing good probabilistic
> estimates.

This is true based on how it's described in textbooks, but in practice it
should always be combined with a calibration algorithm (which add a trivial
O(n) cost to the process). The common choices here are Platt [0] or Isotonic
Regression [1] (the latter should, itself, be combined with a regularization
algorithm because it can easily overfit for outliers).

Given a calibration algorithm Naive Bayes produces probability estimates every
bit as reasonable as any other algorithm.

[0]
[https://en.wikipedia.org/wiki/Platt_scaling](https://en.wikipedia.org/wiki/Platt_scaling)

[1]
[https://en.wikipedia.org/wiki/Isotonic_regression](https://en.wikipedia.org/wiki/Isotonic_regression)

~~~
otabdeveloper
> but in practice it should always be combined with a calibration algorithm
> (which add a trivial O(n) cost to the process).

Why not just use logistic regression at this point? The only benefit of Naive
Bayes over logistic regression is that Naive Bayes is simpler to code.

~~~
_dps
The calibration cost is trivial compared to the coefficient learning cost.
Very roughly, calibration is O(records) whereas coefficient learning is
O(records * features). So the tiny add-on cost of calibration shouldn't affect
anyone's evaluation of the relative merits of algorithms. NB still retains its
computational advantage.

One thing that is often discounted in theoretical discussions is that NB takes
_much_ less I/O than something like LR, typically in the range of 5-100x
(depending on how many iterations you want to do updating your LR
coefficients). If you're doing, for example, a MapReduce implementation then
NB has huge computational advantages. In LR each coefficient update costs you
another map/reduce pass across your entire data set (whereas NB is always done
in exactly one iteration).

So if NB + calibration gets you something close to LR for vastly less
computation and I/O, why _wouldn 't_ you use it?

Having said that, if you're talking about small amounts of data that fit into
RAM and you can "just load into R", then sure use LR over NB. For that matter
use a Random Forest [0]. The reason NB is still around is because it offers a
point in the design space where you spend almost no resources and still get
something surprisingly useful (and recalibration narrows the utility gap
between NB and better methods even more).

[0] And you should _still_ consider calibrating your random forest's output.

------
Supersaiyan_IV
If I understood this correctly Naive Bayes uses "reasoning on the average".
I'm assuming this is good for fast rough estimates at best.

~~~
otabdeveloper
No, not quite. _All_ classification algorithms use "reasoning on the average".
(A.k.a. "statistics".)

Naive Bayes is a poor classifier because it ignores conditional dependency:
when having feature A raises the odds, having feature also B raises the odds,
but having features A and B together lowers the odds.

~~~
nopinsight
Known dependencies can be taken into account by treating joint A&B occurrence
as another feature. The result can sometimes be significantly improved with
this simple hack.

This is an example of why feature engineering is very important, at times more
so than the algorithm choice.

Learning to develop features that go together well with an algorithm is
essential to practical machine learning.

------
outlace
Good post in terms of content; really wish he had a nice syntax-highlighted
code box instead of screenshots of code

------
elliott34
The last two screen shots are the same

