

Ask HN: Libraries for classification of text - ColinWright

I've searched fairly hard, but I'm finding it hard to locate exactly what I want.  Maybe it doesn't exist, and maybe I just need to adapt something that's there, or write my own.  But I thought I'd ask.<p>I have a bunch of text items that have had classifications applied to them. These classification aren't exclusive - each item may be in one or more classes.  Perhaps it's best to think of them as tags.<p>What I want is a system that has a stab at applying these tags to previously unseen items, based on the tags of existing items.  I know that such things exist, but they all seem to be poorly performing, proprietary, hard-to-use, free-but-closed, or otherwise just not quite what I want.<p>Can I take suggestions from the floor?  Is this a problem you've dealt with, and what did you use?<p>TIA.
======
jgrahamc
You are just looking for a multi-label text classifier by the sounds of
things. You could start with Weka: <http://www.cs.waikato.ac.nz/ml/weka/> (see
<http://weka.wikispaces.com/Text+categorization+with+Weka>) or there's the CMU
toolkit called Rainbow (<http://www.cs.cmu.edu/~mccallum/bow/rainbow/>).

Or if you want something simple I wrote about how to create a naive Bayesian
text classifier in Dr. Dobbs some time ago:
<http://www.drdobbs.com/tools/184406064>

~~~
ColinWright
Thanks. And yes, I think I am, which is why I'm sure such things must not only
exist, but should probably be readily available and reasonably easy to use.

I have written a Bayesian classifier before, but that was binary, not multi-
label. I can run multiple binary Bayesian classifiers, but that's probably not
the best thing to be doing.

Hence the question.

I'll check your suggestions - thanks.

 _Added in edit: The package "bow" doesn't compile on my machine, so now I'm
looking at an undetermined amount of time fixing the package so it does
compile. Always fun._

------
khandelwal
I've used NLTK (<http://www.nltk.org>) to do something similar. And there's
also <http://scikit-learn.org/stable/>. Both libraries are for Python.

As far as my experience with NLTK and machine learning in general goes: you'll
need to spend a fair bit of time figuring out which features to extract.

~~~
ColinWright
Thanks for those - I'll go check them out and see if I can bend them to my
will. Cheers!

