

Ask YC: Blog parsing (WordPress,Typepad,Blogger) - samson

I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page.<p>I want these pages -&#62;http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/<p>Not this -&#62;http://1vibe.net/category/behind-the-scenes/
Not this -&#62;http://1vibe.net/2008/11/
Not this -&#62;http://1vibe.net/tag/50-cent/<p>From the blog post page I want to grab the title and date of that post<p>The way I trying to do it was to look through the DOM of the site and look for consistency.
I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site.<p>So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it.<p>I appreicate it if anyone could help ( also I'm doing it in php).
======
raquo
If you are interested only in new posts, you can look in blogs' RSS feeds.
They are nearly always in default locations.

Or you could parse the URL - I had a similar task some time ago, and I went
with URLs - Blogger and Typepad are consistent; WordPress depends on the blog,
of course, but you could figure out several most popular patterns (e. g.
/yyyy/mm/dd/posttitle, /id-posttitle) and get like 90% of all blogs right.

Or maybe, just maybe, you could use some third parties that have already
figured it out via RSS - maybe Technorati?

~~~
samson
Yea, I think thats the route i'll end up going, I've already started
developing a pattern system, and overnight I thought of a few ways that might
make that easier to get the title and date of a page.

There's only one thing I'm still stumped on and thats simply how do you tell
when your on the original article page and not the index/tag/search/ that
still sometimes contains the same content as the article page.

------
Raphael
Just parse the URL. Or you can pull in the RSS feed, although that usually
only goes back 20 posts.

