Ask YC: How would you do keyword classification

osipov · on April 13, 2008

i think you are looking for document classification algorithms: http://en.wikipedia.org/wiki/Document_classification

the current state of the art algorithms are based on support vector machines, but their learning part could be tricky to implement in a scalable fashion. if you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive "naive Bayes" :) is simple and is adequate for many applications

osipov · on April 13, 2008

forgot to mention that you may want to look at the Orange framework which is in Python http://www.ailab.si/orange/

groovyone · on April 13, 2008

great! that does look interesting. I'll have a good look through it. You involved in this field yourself? If so, drop me a line as we're looking for a consultant to help us with something

osipov · on April 13, 2008

I've done some work on machine learning and specifically document classification in a corporate behemoth. Send me an email to gmail.com and prefix that with osipov followed by an @ sign -- I'll get back to you.

wehriam · on April 14, 2008

This sounds interesting to me too, and in an area where I've had some experience. If nothing else perhaps we can trade notes. You'll find contact details in my profile.

yawl · on April 14, 2008

I think every language has an open source naive Bayes/Bayes network implementation. And most of the time they are good enough.

KVM (support vector machines) so far is considered the best classification algorithm.

osipov · on April 14, 2008

>KVM (support vector machines) so far is considered the best classification algorithm.

In the spirit of accuracy, SVM algorithms aren't _the best_. The best algorithms are ensemble-based, incorporating SVM and alternatives.

omarish · on April 13, 2008

LSI - Latent Semantic Analysis is pretty useful in this field. Python has an amazing toolkit for it

http://nltk.org/index.php

Email me if you have any questions-I've been playing with this stuff for a while and it's really interesting.

pierrefar · on April 13, 2008

I'm playing with some code for this just for fun. There are two ways to classify a document:

1. You already have a set of keywords (categories) that the document can belong to. The objective is to match a document to its category. This is true classification.

2. You want to extract relevant keywords from a document. This is not classification in the true sense but keyword extraction.

Each one has different approaches to achieve it, but they are similar problems. (As an example of similarity: you have a set of categories each defined by a tag cloud. You extract keywords from a document and see which tag cloud it matches best.)

So how do you do each one?

Classification: I'm not well versed in this area and I'm interested in learning - it's next on my to-do list.

Keyword Extraction: Yahoo! has an API to do that, but honestly, it's rubbish. I don't know how it "works" but it doesn't really. Open Calais is really good but has a noticeable error rate (I didn't quantify it but after trying many documents with it, I regularly noticed minor mistakes).

Hope this helps.

diego · on April 13, 2008

Check out our tool, http://tagger.flaptor.com

It's based on a Bayesian algorithm plus a bunch of other heuristics for fine-tuning. In our case, the classification algorithm is not nearly as important as how we select documents for the training set.

Also, take a look at this post that was mentioned here a few days ago:

More data usually beats better algorithms: http://anand.typepad.com/datawocky/2008/03/more-data-usual.h...

presty · on April 14, 2008

I believe the keywords here (heh) are Text Mining. You should look at it (I'm doing my final project on this field).

There are a lot of things to have in mind when trying to do document classification with keywords. Document preprocessing, keyword weighting and much more.

Someone mentioned Latent Semantic Indexing, it's worth a look, but if you're just starting, you should look at the big picture.

siculars · on April 13, 2008

you may want to try, or match your results against:

http://opencalais.com/

nreece · on April 13, 2008

Have a look at Yahoo's Content Analysis Term Extraction Web Service: http://developer.yahoo.com/search/content/V1/termExtraction....

And http://opencalais.com/

tonystubblebine · on April 13, 2008

I use the Yahoo Term Extractor API. This probably doesn't meet your long term needs, but it's great for prototyping.

http://developer.yahoo.com/search/content/V1/termExtraction....

bmatheny · on April 13, 2008

Just note that you are limited to 5k queries per 24 hours per IP address with this. Like Tony said, NOT for production.

shafqat · on April 13, 2008

When you refer to classification, are you really talking about clustering? Is it the same thing? I've been looking at document/article clustering for NewsCred, and so far have spent some time with Carrot. Its open source and pretty good...

presty · on April 14, 2008

clustering != classification

- classification (also known as categorization) - you have a set of categories into which you expect your documents to be assigned to (the documents will be matched with the existing categories)

- clustering - you expect the algorithm to give you a set of categories (the ML algorithms will find similarities between the several documents and will group them accordingly)

groovyone · on April 13, 2008

Hi there. I've not heard of Carrot. Do you have a link for it? My Google search brought up some orange things you grow and eat :)

jgrahamc · on April 13, 2008

One simple thing to do would be to download CRM-114 and use it.

antiismist · on April 13, 2008

See here for more on CRM-114:

http://news.ycombinator.com/item?id=124085