
Naïve Bayes for Machine Learning - ReDeiPirati
https://blog.floydhub.com/naive-bayes-for-machine-learning/
======
jacquesm
If you prefer code over text:

    
    
      float 
      bayes(float prior, BAYES_TEST * test, int result) {
        if (result) {
          return prior* test->tp / (prior * test->tp + (1-prior) * test->fp);
        }
        return (prior * (1-test->tp)) / ( (prior * (1-test->tp)) + (1-prior) * (1-test->fp));
      }
    

This is basically the core of anything that uses 'Naive Bayes', that's really
all there is to it. Note that Naive Bayes tends to 'clamp' to 0 or 1 after
enough evidence has been processed, you're not going to end up with a value
somewhere in the middle in the vast majority of cases. Also, when it doesn't
work there won't be any hint that you are mis-classifying, because of the
above mentioned property the posterior returned is not going to help much in
terms of determining your confidence level. Note that if you evaluate a lot of
evidence sequentially and the criteria are not 'independent' then you won't
get good results. Independent criteria can vary _independently_ from each
other, so for instance if you base three of your criteria on someone's IP
address you won't get much mileage out of the second and third and you're
going to over-represent that factor in the weighing of the evidence.

Still, it is super easy to get up and running, will work with remarkably
little data to train with (determine those false positive and true positive
ratios) and runs very fast during classification, it also requires very little
in terms of hardware (no GPU or anything like that).

Yes, you can improve on this, but it isn't always worth the effort or the
resources, I've built more than one revenue generating tool with this.

~~~
krackers
What are test->tp and test->fp here?

Edit: Ah I'm guessing false positive and and true positive rate?

~~~
jacquesm
Yep, sorry that could have been made clearer. I typically have all the tests
in an array of structs that way I can add a pointer to a function to actually
execute the test and add a name to it for debugging purposes.

This makes adding tests very easy.

~~~
zitterbewegung
Is this C or Go code ?

~~~
ethhics
C, given that the types are before the variables, and the function isn’t
declared with func.

------
hakcermani
What a beautiful article. I have read Bayes many times and each time just when
I thought I had it it slipped away. This article does a beautiful job of
explaining Bayes in simple terms and its application in classification. Thank
you!

~~~
windsignaling
Thumbs down for me.

How can someone write that much text, and then all they end up doing is
importing scikit-learn?

I can figure out how to do that myself in 2 minutes.

The real understanding of machine learning comes when you start implementing
the models yourself.

~~~
jackhack
Bayes is very simple, easily understood by a 6 year old, if explained
properly. There are scary looking integral equations used to express it, but
ultimately if you can count you can implement this.

Let's build a bayesian spam engine:

Start with a known-good email (not-spam) and a solicitation (spam). For each
word in the not-spam document, count the number of times each word appears.
That's your not-spam "corpus". Repeat for the known-spam document. Now you
have a list of words+frequency (dictionary of counts). There will be many
words which appear in both lists. This is normal.

Now, when you have an unclassified document, for each word in the document
check both lists (spam & not-spam). In which list does this word appear more
frequently? if the count for that word is higher in the spam list, then this
is a "spam" word, therefore this document is +1 spam. If a not-spam word, then
this document is +1 not-spam. Repeat for all words. At the end, which count
(spam or not-spam) has the higher total? There's your answer. Spam or not-
spam. Done.

Simple counting and comparison.

Seriously, that's the algorithm. It's that simple.

If the system miscategorizes a document, move it to the appropriate group
yourself and count all the words again. (retrain). The system just "learned"
from the mistake. It got "smarter".

You can easily add more categories, other than "spam" and "not spam" \--
"questions from customers", "family & friends", etc. It works the same. The
group with the highest total "wins" the classification.

extra credit: Now here's the interesting part. The corpus doesn't have to be
words. It could be a switch set to "on" or "off." Joint angles on a robot.
Ping time of a packet. Temperature from a thermistor.

And the classification doesn't need to be spam/not-spam, it could be "friend
vs foe", upright/inverted, safe/not-safe, or anything else. The algorithm
doesn't care, it's looking for probability of being in a classification group.

There you go. Have fun!

~~~
baron_harkonnen
>Seriously, that's the algorithm. It's that simple.

It's funny because when I saw that line I thought I would be pedantic and
double check to see if you got the laplacian smoothing correct (because in
practice it's never "that simple" for any numeric algorithm) and then realized
that you don't understand how to implement Naive Bayes' at all.

For starters you aren't using probability. If you want to put all of the words
in each document together into a two dictionaries of counts, then for each
word in the unclassified document you want to look at the product of the
probability of those words appearing in the spam corpus vs the non-spam
corpus. That probability is n_word/total_words the corpus.

This is where you need to do some smoothing because if a word does not appear
in the one of the corpora then you will get a probability of 0 for that class.
Smoothing just adds 1 to the numerator and N_classes to the denominator. It is
the equivalent of assuming a weakly informative uniform prior.

~~~
pokernaming
> That probability is n_word/total_words the corpus.

In this case, wouldn't it actually be better to just drop the denominator,
because it will be the same for both (spam & not spam).

------
drderidder
This simple intro to machine learning with node.js has been one of the more
popular talks I've given:
[http://73rhodes.github.io/talks/MachineLearning/#/](http://73rhodes.github.io/talks/MachineLearning/#/)

~~~
LegitShady
This would be way better in a format where I didn't have to click over and
over. One page with the text going down, even.

Yours was a decent example. I've seen one for weather conditions on a golf
course and people's inclination to play used to predict how busy you'd be
based on the weather that was really good.

~~~
drderidder
They're slides intended for a live presentation. You can use space bar or
arrow keys to flip through as well. We've found this format to work well for
our 2000+ member meetup group, but I might follow your suggestion to write it
out long-form sometime. The more technical documentation is at:
[https://github.com/73rhodes/dclassify](https://github.com/73rhodes/dclassify)

------
abhgh
I wrote a starter article quite a while ago about using Naive Bayes for
analyzing clickstream data if anyone is interested:
[https://tinyurl.com/ulmam85](https://tinyurl.com/ulmam85)

Recommend downloading the PDF for better rendering of math symbols.

------
usmannk
Huh, the embedded images all 403 for me but work when accessed directly.

Edit: Now it's like.. 50/50?

~~~
ReDeiPirati
pretty strange... are you browsing via desktop/laptop or mobile device?

------
RocketSyntax
How is this a top voted post? Bayes is 101. Someone posted a world-class
pandas UI today and it's on page 3.

------
waymore84
Very much for your help you with the details of the matter is that you are not
feeling

