I'm trying to automate the classification of documents that we are seeing from our crawler. I have done a google search for keyword classification using Python but I'm not getting any joy. Anyone here automatically tagging or classifying documents (Maybe after some teaching) ? Would be good to know how if you are
the current state of the art algorithms are based on support vector machines, but their learning part could be tricky to implement in a scalable fashion. if you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive "naive Bayes" :) is simple and is adequate for many applications
great! that does look interesting. I'll have a good look through it. You involved in this field yourself? If so, drop me a line as we're looking for a consultant to help us with something
I've done some work on machine learning and specifically document classification in a corporate behemoth. Send me an email to gmail.com and prefix that with osipov followed by an @ sign -- I'll get back to you.
This sounds interesting to me too, and in an area where I've had some experience. If nothing else perhaps we can trade notes. You'll find contact details in my profile.
I'm playing with some code for this just for fun. There are two ways to classify a document:
1. You already have a set of keywords (categories) that the document can belong to. The objective is to match a document to its category. This is true classification.
2. You want to extract relevant keywords from a document. This is not classification in the true sense but keyword extraction.
Each one has different approaches to achieve it, but they are similar problems. (As an example of similarity: you have a set of categories each defined by a tag cloud. You extract keywords from a document and see which tag cloud it matches best.)
So how do you do each one?
Classification: I'm not well versed in this area and I'm interested in learning - it's next on my to-do list.
Keyword Extraction: Yahoo! has an API to do that, but honestly, it's rubbish. I don't know how it "works" but it doesn't really. Open Calais is really good but has a noticeable error rate (I didn't quantify it but after trying many documents with it, I regularly noticed minor mistakes).
It's based on a Bayesian algorithm plus a bunch of other heuristics for fine-tuning. In our case, the classification algorithm is not nearly as important as how we select documents for the training set.
Also, take a look at this post that was mentioned here a few days ago:
I believe the keywords here (heh) are Text Mining. You should look at it (I'm doing my final project on this field).
There are a lot of things to have in mind when trying to do document classification with keywords. Document preprocessing, keyword weighting and much more.
Someone mentioned Latent Semantic Indexing, it's worth a look, but if you're just starting, you should look at the big picture.
When you refer to classification, are you really talking about clustering? Is it the same thing? I've been looking at document/article clustering for NewsCred, and so far have spent some time with Carrot. Its open source and pretty good...
- classification (also known as categorization) - you have a set of categories into which you expect your documents to be assigned to (the documents will be matched with the existing categories)
- clustering - you expect the algorithm to give you a set of categories (the ML algorithms will find similarities between the several documents and will group them accordingly)
the current state of the art algorithms are based on support vector machines, but their learning part could be tricky to implement in a scalable fashion. if you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive "naive Bayes" :) is simple and is adequate for many applications