Hacker News new | past | comments | ask | show | jobs | submit login
Ask YC: How would you do keyword classification
15 points by groovyone on April 13, 2008 | hide | past | favorite | 20 comments
I'm trying to automate the classification of documents that we are seeing from our crawler. I have done a google search for keyword classification using Python but I'm not getting any joy. Anyone here automatically tagging or classifying documents (Maybe after some teaching) ? Would be good to know how if you are



i think you are looking for document classification algorithms: http://en.wikipedia.org/wiki/Document_classification

the current state of the art algorithms are based on support vector machines, but their learning part could be tricky to implement in a scalable fashion. if you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive "naive Bayes" :) is simple and is adequate for many applications


forgot to mention that you may want to look at the Orange framework which is in Python http://www.ailab.si/orange/


great! that does look interesting. I'll have a good look through it. You involved in this field yourself? If so, drop me a line as we're looking for a consultant to help us with something


I've done some work on machine learning and specifically document classification in a corporate behemoth. Send me an email to gmail.com and prefix that with osipov followed by an @ sign -- I'll get back to you.


This sounds interesting to me too, and in an area where I've had some experience. If nothing else perhaps we can trade notes. You'll find contact details in my profile.


I think every language has an open source naive Bayes/Bayes network implementation. And most of the time they are good enough.

KVM (support vector machines) so far is considered the best classification algorithm.


>KVM (support vector machines) so far is considered the best classification algorithm.

In the spirit of accuracy, SVM algorithms aren't _the best_. The best algorithms are ensemble-based, incorporating SVM and alternatives.


LSI - Latent Semantic Analysis is pretty useful in this field. Python has an amazing toolkit for it

http://nltk.org/index.php

Email me if you have any questions-I've been playing with this stuff for a while and it's really interesting.


I'm playing with some code for this just for fun. There are two ways to classify a document:

1. You already have a set of keywords (categories) that the document can belong to. The objective is to match a document to its category. This is true classification.

2. You want to extract relevant keywords from a document. This is not classification in the true sense but keyword extraction.

Each one has different approaches to achieve it, but they are similar problems. (As an example of similarity: you have a set of categories each defined by a tag cloud. You extract keywords from a document and see which tag cloud it matches best.)

So how do you do each one?

Classification: I'm not well versed in this area and I'm interested in learning - it's next on my to-do list.

Keyword Extraction: Yahoo! has an API to do that, but honestly, it's rubbish. I don't know how it "works" but it doesn't really. Open Calais is really good but has a noticeable error rate (I didn't quantify it but after trying many documents with it, I regularly noticed minor mistakes).

Hope this helps.


Check out our tool, http://tagger.flaptor.com

It's based on a Bayesian algorithm plus a bunch of other heuristics for fine-tuning. In our case, the classification algorithm is not nearly as important as how we select documents for the training set.

Also, take a look at this post that was mentioned here a few days ago:

More data usually beats better algorithms: http://anand.typepad.com/datawocky/2008/03/more-data-usual.h...


I believe the keywords here (heh) are Text Mining. You should look at it (I'm doing my final project on this field).

There are a lot of things to have in mind when trying to do document classification with keywords. Document preprocessing, keyword weighting and much more.

Someone mentioned Latent Semantic Indexing, it's worth a look, but if you're just starting, you should look at the big picture.


you may want to try, or match your results against:

http://opencalais.com/


Have a look at Yahoo's Content Analysis Term Extraction Web Service: http://developer.yahoo.com/search/content/V1/termExtraction....

And http://opencalais.com/


I use the Yahoo Term Extractor API. This probably doesn't meet your long term needs, but it's great for prototyping.

http://developer.yahoo.com/search/content/V1/termExtraction....


Just note that you are limited to 5k queries per 24 hours per IP address with this. Like Tony said, NOT for production.


When you refer to classification, are you really talking about clustering? Is it the same thing? I've been looking at document/article clustering for NewsCred, and so far have spent some time with Carrot. Its open source and pretty good...


clustering != classification

- classification (also known as categorization) - you have a set of categories into which you expect your documents to be assigned to (the documents will be matched with the existing categories)

- clustering - you expect the algorithm to give you a set of categories (the ML algorithms will find similarities between the several documents and will group them accordingly)


Hi there. I've not heard of Carrot. Do you have a link for it? My Google search brought up some orange things you grow and eat :)


One simple thing to do would be to download CRM-114 and use it.


See here for more on CRM-114:

http://news.ycombinator.com/item?id=124085




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: