

Ask/Poll News.YC: What is a good open source Bayes classifier? - fiaz

I'm just doing a little research for a project and I thought everybody could benefit from the fact that we have happen to have an expert on the subject as a member.<p>I would also like learn from what other people have used as well so any input that others could give would be helpful.<p>------------<p>UPDATE: I started a poll thread below so that we can submit a project link and up-vote a link if we have experience or knowledge over one project versus another.<p>Thank you to everybody who replied, you are great!!!
======
dreish
Whichever one you use, you'll probably need to edit the source to change the
probability formula from (n_hits / n_total) to ((n_hits + 1) / (n_total + 2)).
That's the correct formula based on an even distribution of probabilities
(which is close enough to the actual distribution in most situations for this
to be a huge improvement). I can never find a reference for this when I search
for one, but you can verify it experimentally with a short program in your
favorite dynamically-typed language, or a long program.

For example, if you live in a world with only black and white birds, but you
don't know the percentage of each and have no reason to believe it's more
likely 2% black than 70% black, or any other percentage, if you see two black
birds fly by, that doesn't mean the next bird you see has a 100% probability
of being black, but that's exactly the assumption most widely-used naive
Bayesian classifiers make.

I modified SpamProbe to use (spam+1)/(total+2), and the results have been
good.

~~~
jgrahamc
Sounds to me that you are describing Laplace Smoothing in a Naive Bayes
classifier. This is a pretty standard technique for avoiding the problem that
you are seeing where the probability comes out as 100%/0% because of a lack of
information in the model.

~~~
dreish
Thanks, yes, that's what it's called.

All due credit to Laplace for the technique, but the word "smoothing" is
making me wince, because it makes it sound as though this is some artificial
approximation. For the assumption of an even distribution of probabilities,
n+1 / m+2 really _is_ the exact probability of the event repeating. Like I
said, you can confirm this experimentally with a quick program.

~~~
apgwoz
There are other smoothing techniques more prevalent in NLP that I'm learning
about in my NLP class which distribute probability mass more evenly, but
really won't help (at least I don't think so) in a classifier. Witten-Bell and
Good-Turing are the ones I can think of off the top of my head.

------
jgrahamc
What sort of classification are you trying to do? Text, I assume. If it's text
and you need something open source, or just a pointer to how to write one
yourself, you could read the article I wrote for Dr Dobbs on this subject:
<http://www.ddj.com/development-tools/184406064>

There are quite a lot of toolkits out there that do Bayesian things (take a
look at libbow or Weka).

~~~
fiaz
I'm looking for something that is open source that I can start using right
away with minimal configuration... I'm not smart enough to create my own, but
clever enough to take somebody else's work and cater it to what I'm trying to
do.

It's only for text processing.

------
jackbard
<http://www.codeproject.com/KB/cs/BayesClassifier.aspx>

This is a simple one built in C#/.Net. I fixed a significant bug and raised
classification accuracy from 74% -> 96% in my noisy dataset (automobile
accident claims). I emailed the author with a bunch of improvements (such as
histograms) and some other tweaks but never heard back. Anyway, the bugfix is
simple: take a look at category.cs. In TeachPhrase(), move m_TotalWords++
inside the test for "if (!m_Phrases.TryGetValue(phrase, out pc))".

What you want here is to count the # of unique words. The original code was
counting the total # of times all words appear.This one change reduced
classification errors by 3X.

Cheers, \--Jack

~~~
jackbard
Oh, forgot to mention that you'll want to compute probabilities using
logarithms to avoid underflow precision loss (which will introduce
classification errors). example: wordValue = System.Math.Log((double)count /
(double)cat.TotalWords);

------
food79
The Cadillac of bayes classifiers is CRM114--it can use classifiers that are
far more advanced than naive bayes, such as clustering, or with hidden markov
modeling.

~~~
ketralnis
I can not over-recommend crm114. I've been using it to classify some database
entries and its accuracy is second-to-none, and its custom language makes
working with strangely-stored data (like database entries) easy (after you
learn the strange language)

~~~
fiaz
Is there any documentation that stands out for learning the alien language?

No doubt, I'll be combing through all of the CRM114 information on the
website. Is there anything that is not referenced there that will be of use?

~~~
food79
If you are doing a ham/spam type classification, then you won't need the alien
language. I am almost a total tech novice and I was able to do well with just
some bash scripts. Of course the docs will teach you about better ways to
train the system, if you are interested in going from 98% correct
classification to 99.5% correct.

learn ham.css < file_to_learn.txt

learn spam.css < file_to_learn.txt

classify < file_to_classify.txt

~~~
fiaz
I am NOT doing ham/spam type classification. I need to define some
classifications for specific types of content.

~~~
a-priori
Then substitute ham/spam for whatever those types of content are.

~~~
fiaz
Thanks for the affirmation! It helps when jumping into territory with which I
have no previous experience.

------
henning
The one with the simplest interface I've seen is
<http://divmod.org/trac/wiki/DivmodReverend> . Looks very easy to get started
with.

If you want to get more serious, use Weka or Bow or YALE or something
implemented in a reasonably fast language.

------
yawl
I use Weka in my project. Btw, there is a book about Weka:
<http://www.cs.waikato.ac.nz/~ml/weka/book.html>

~~~
fiaz
How does Weka differ from CRM114?

~~~
jorgeortiz85
I'm not familiar with CRM114, but upon cursory inspection it seems to be
written in C and targeted mainly at document classification.

Weka is written in Java and implements all kinds of machine learning/data
mining tools.

------
ashu
Andrew Mccallum's Bow library (CMU) is a very robust toolkit.
<http://www.cs.cmu.edu/~mccallum/bow/>

Very high quality. Multiple people have used it for serious research.

------
vital_sol
If Perl is ok, you can try these:
[http://search.cpan.org/search?query=Bayes&mode=all](http://search.cpan.org/search?query=Bayes&mode=all)

------
simplegeek
I'm not sure about this one but you should definitely look into _Orange_ if
you're considering Python. <http://magix.fri.uni-lj.si/orange/>

------
toplakm
You could try Orange: <http://www.ailab.si/orange>

It is written in combination of Python and C. It can be used as a python
module.

------
bcater
The one in BioPython is pretty good.

------
fiaz
PROJECT LINKS/POLL

 _reply only to this message and vote on links below_

~~~
fiaz
Ruby :: Classifier gem

<http://classifier.rubyforge.org/>

------
mig
Use CRF for classifying/labeling.

<http://crf.sourceforge.net/>

Its extremely well written, easy to define features etc. Reasonably good
support too.

------
herdrick
Write it yourself. I will say that teasing out the features you want to score
can be some work; does anyone here know if the CRM114 package helps with that?

------
pg
crm114

~~~
ivankirigin
awesome logo

<http://crm114.sourceforge.net/>

