

Helpful Resources on Web Crawling? - jfornear

Lately, I have been trying to think of a project to work on that would occupy my time while I'm bored in the dorm. I don't really know what it will be yet, but I want it to involve a web crawler because those are really interesting to me for some reason. For example, The Hype Machine (http://hypem.com) has a crawler that pulls MP3 links from a bunch of music blogs. That is awesome! How does that work?!<p>I found some open source crawlers, but have no idea where to start. I've read some research papers on them as well, but they seem outdated. I'm looking for resources that would help me get started with the most modern approach--if that makes sense.<p>I would really appreciate any helpful resources you might have to share. Thanks<p>I am most familiar with PHP, but I am open-minded.
======
cnu
Check out the Nutch crawler. It closely integrates with the Lucene search
engine project.

If you want something which is very specific (like the hype machine) you can
easily create a basic crawler in python. There is not much change in the
crawling strategies. Just a basic programming knowledge and reading the
documentation of the urllib module (in python) would be enough.

~~~
jfornear
I came across Nutch actually. I just didn't know if it was recommended.Thanks,
I'll look into urllib and find some python tutorials.

------
gtani
read the oreilly book. On your sources: read ToS, robots.txt, and don't pound
on anybody servers, even if you're iwthin ToS.

<http://news.ycombinator.com/item?id=158902>

theres a bunch of python scrapers in py pkg index

[http://pypi.python.org/pypi?%3Aaction=search&term=crawle...](http://pypi.python.org/pypi?%3Aaction=search&term=crawler&submit=search)

until you're sure your IP won't be blacklisted, test crawler from your
friend's houses. I've heard that many cablecos/telcos regard a request for a
new IP as very suspicious!

------
pierrefar
We had a recent discussion about this:

<http://news.ycombinator.com/item?id=150077>

------
adatta02
solr might also be worth a look. <http://lucene.apache.org/solr/> It's based
on Lucene also but has JSON/XML connectors to alleviate some of the pain of
using Nutch.

~~~
cnu
solr is a search engine (based on lucene which itself is a search engine). it
won't help the OP as he wants to crawl websites. Maybe if he wants to search
through the crawled data later, he can use it. I would recommend solr instead
of lucene as it has faceted search and updates to documents which (i think )
was missing in lucene.

