

The Perils of Web Crawling - necrodome
http://streamhacker.com/2010/10/04/perils-web-crawling/

======
trustfundbaby
Writing a web crawler 5-6 years ago (in PHP no less) was what really turned me
into a programmer ... Up to that point I had just been a html/css dude who
would hack at PHP as needed, but kept trying to do more and more with it.

I learned _so_ much about the interwebs ... http, urls (their construction),
html markup and why things work the way they do ... learned about threading,
using queues, and finally ... really grokked OOP.

Best of all I gained a newfound respect and understanding of Googlebot and web
browsers in General ... dealing with people's crazy ass html code is not.
easy.

If I ever teach a class on programming ... its something I'd love to have my
students attempt as a semester long (background) project.

Good times.

------
kakaylor
It is worth emphasizing that what the author is talking about is more akin to
screen scraping than web crawling. Both tasks have their challenges, but
screen scraping has several that are inherently difficult to overcome.

In particular, with screen scraping, you are trying to extract structured data
from a markup language (in this case HTML) that simply doesn't guarantee the
structure your looking for. With web crawling you only need the structural
guarantees offered by the HTML markup (not even that, with the quality of
libraries such as TagSoup or Neko).

Now, that isn't to say web crawling doesn't have its own challenges (URL
canonicalization anyone?).

------
shib71
If the conclusion is to not write a web crawler, the next step is a service
like <http://www.80legs.com/>.

------
redstripe
"Sites will ban you"

Wish it would happen here on HN. Seems like it's a weekly occurrence where
someone announces a pet project that involves crawling the whole site. I can't
help but to think this is connected to the 30+ second page loads I get here
often.

