

Ask HN: Is there a great open source crawler? - haihai

I want to crawl a site, like foxnews.com and find all the URLs that match a pattern.<p>A pattern like:<p>http://www.foxnews.com/\w+/\d+/\d+/\d+/.*/&#60;p&#62;That would find all URLs like:<p>http://www.foxnews.com/world/2010/06/02/report-natalee-holloway-suspect-sought-murder-peru/<p>I know I could do this myself. I also know it's a seemingly easy problem, that is actually quite hairy.<p>I'm hoping there's an open source crawler that I can point to a start page and say "Find all the URLs that match this pattern.".<p>I know there are dozens of crawlers out there. I just don't know if there's one or two that are really great. I'm really hoping there's a modern/simple/fast one that would be good for this purpose.<p>Does such a thing exist? If not, are there any great documents detailing the common problems and how to solve them?<p>Thank you.
======
ericwaller
Scrapy (<http://scrapy.org/>) matches your description pretty well. You can
specify which urls to crawl with regular expressions and then provide a bit of
code to do some data extraction.

------
yourabi
None of the ones I know are that simple - but Check out Heritrix at
<http://crawler.archive.org> and Nutch at <http://nutch.apache.org> Also worth
checking out is <http://80legs.com>

------
Ledio
Nutch has a full on web crawler, a lot of features, and it scales pretty well.
You can white list or black list URLs as you see fit, and filter out unwanted
content.

------
iworkforthem
There a quite a few open source web crawlers, in Java, PHP, etc.. Based on
what you mentioned, Nutch is alright... But not sure how are you going to get
all those unstructured information out and make it structured.

------
jdrock
Tooting my own horn, but this takes all of 1 minute in 80legs if you've got
the right regex.

~~~
haihai
I considered 80legs. The problem is that I don't want to do this for just one
site. I may have dozens. I can run this off my own servers/bandwidth (which I
already pay for) for much cheaper than using 80legs.

If 80legs was purely usage based and cheaper I probably would have used it.

~~~
jdrock
Understood. We do work with folks on custom per-use plans, but we just need to
figure out what your requirements are so that we can see if there's some way
for us to help.

