

Ask YC: How would you detect a page type? - johnnycage

Hi. If you were creating a script which searched for forums, or detected blogs etc how would you do it? We're working on a system that crawls websites and it would be great to be able to say "this is a blog" or "this is a news site" or "this is a forum".  Any suggestions on how to do it?
======
bct
You could get clever algorithmically, but I think it would be far simpler to
look for identifying traits of specific software packages. eg. if you see
links to /wp-content/something, you're looking at a Wordpress site and
therefore a blog.

Obviously that alone won't be able to classify a lot of sites, but I suspect
it will get the bulk of them.

------
inovica
Not sure if this is what you want, but look at the NLTK for Python

