

Ask YC: How would you do keyword classification - groovyone

I'm trying to automate the classification of documents that we are seeing from our crawler.  I have done a google search for keyword classification using Python but I'm not getting any joy. Anyone here automatically tagging or classifying documents (Maybe after some teaching) ? Would be good to know how if you are
======
osipov
i think you are looking for document classification algorithms:
<http://en.wikipedia.org/wiki/Document_classification>

the current state of the art algorithms are based on support vector machines,
but their learning part could be tricky to implement in a scalable fashion. if
you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive
"naive Bayes" :) is simple and is adequate for many applications

~~~
osipov
forgot to mention that you may want to look at the Orange framework which is
in Python <http://www.ailab.si/orange/>

~~~
groovyone
great! that does look interesting. I'll have a good look through it. You
involved in this field yourself? If so, drop me a line as we're looking for a
consultant to help us with something

~~~
osipov
I've done some work on machine learning and specifically document
classification in a corporate behemoth. Send me an email to gmail.com and
prefix that with osipov followed by an @ sign -- I'll get back to you.

------
omarish
LSI - Latent Semantic Analysis is pretty useful in this field. Python has an
amazing toolkit for it

<http://nltk.org/index.php>

Email me if you have any questions-I've been playing with this stuff for a
while and it's really interesting.

------
pierrefar
I'm playing with some code for this just for fun. There are two ways to
classify a document:

1\. You already have a set of keywords (categories) that the document can
belong to. The objective is to match a document to its category. This is true
classification.

2\. You want to extract relevant keywords from a document. This is not
classification in the true sense but keyword extraction.

Each one has different approaches to achieve it, but they are similar
problems. (As an example of similarity: you have a set of categories each
defined by a tag cloud. You extract keywords from a document and see which tag
cloud it matches best.)

So how do you do each one?

Classification: I'm not well versed in this area and I'm interested in
learning - it's next on my to-do list.

Keyword Extraction: Yahoo! has an API to do that, but honestly, it's rubbish.
I don't know how it "works" but it doesn't really. Open Calais is really good
but has a noticeable error rate (I didn't quantify it but after trying many
documents with it, I regularly noticed minor mistakes).

Hope this helps.

------
diego
Check out our tool, <http://tagger.flaptor.com>

It's based on a Bayesian algorithm plus a bunch of other heuristics for fine-
tuning. In our case, the classification algorithm is not nearly as important
as how we select documents for the training set.

Also, take a look at this post that was mentioned here a few days ago:

More data usually beats better algorithms:
[http://anand.typepad.com/datawocky/2008/03/more-data-
usual.h...](http://anand.typepad.com/datawocky/2008/03/more-data-usual.html)

------
presty
I believe the keywords here (heh) are Text Mining. You should look at it (I'm
doing my final project on this field).

There are a lot of things to have in mind when trying to do document
classification with keywords. Document preprocessing, keyword weighting and
much more.

Someone mentioned Latent Semantic Indexing, it's worth a look, but if you're
just starting, you should look at the big picture.

------
siculars
you may want to try, or match your results against:

<http://opencalais.com/>

------
nreece
Have a look at Yahoo's Content Analysis Term Extraction Web Service:
[http://developer.yahoo.com/search/content/V1/termExtraction....](http://developer.yahoo.com/search/content/V1/termExtraction.html)

And <http://opencalais.com/>

------
tonystubblebine
I use the Yahoo Term Extractor API. This probably doesn't meet your long term
needs, but it's great for prototyping.

[http://developer.yahoo.com/search/content/V1/termExtraction....](http://developer.yahoo.com/search/content/V1/termExtraction.html)

~~~
bmatheny
Just note that you are limited to 5k queries per 24 hours per IP address with
this. Like Tony said, NOT for production.

------
shafqat
When you refer to classification, are you really talking about clustering? Is
it the same thing? I've been looking at document/article clustering for
NewsCred, and so far have spent some time with Carrot. Its open source and
pretty good...

~~~
presty
clustering != classification

\- classification (also known as categorization) - you have a set of
categories into which you expect your documents to be assigned to (the
documents will be matched with the existing categories)

\- clustering - you expect the algorithm to give you a set of categories (the
ML algorithms will find similarities between the several documents and will
group them accordingly)

------
jgrahamc
One simple thing to do would be to download CRM-114 and use it.

~~~
antiismist
See here for more on CRM-114:

<http://news.ycombinator.com/item?id=124085>

