

Ask YC: Existing Crawlers in Python - groovyone

Hi there. Hope you don't mind me posting. We're trying to create a system for analyzing web pages and classifying them.  We have done the classification using CRM114 (thanks for this link which we were passed previously) but we're now looking to create a reliable, fast and robust crawler.  We have gone through Twisted and created something basic, but the question has been raised of what is out there already that has already been tested, ensures that it doesn't overloads sites, conforms to robots.txt and that can work across multiple servers?  We've looked at Pyro for the multiple server element (which looks fine) but we're struggling a little. I thought I'd ask here if anyone has any pointers for a great, compact Python crawler that we could use?<p>Thanks in advance<p>Neil
======
yan
Does it have to be Python? I'm sure you can use any webcrawler to actually
crawl, and use Python to analyze the results.

Nutch (<http://lucene.apache.org/nutch/>) is a project to create a search
engine, with a big crawling component. You can also find a list of crawlers
here: [http://en.wikipedia.org/wiki/Web_crawler#Open-
source_crawler...](http://en.wikipedia.org/wiki/Web_crawler#Open-
source_crawlers)

~~~
dshah
+1 for Nutch. We're using it at my startup HubSpot and it has worked well for
us.

------
rams
Harvestman is written in Python.

<http://www.harvestmanontheweb.com/>
<http://harvestman.everythingability.com/>

------
yourabi
I would also take a look at Heritrix (<http://crawler.archive.org/>) -- it's
what powers the wayback machine.

~~~
gojomo
Thanks for the plug!

As a developer of Heritrix, I can't honestly say it's compact or Python, but
it is well-behaved, highly customizable (both by settings and by many Java
extension points), and capable of high-volume crawling for many purposes.

You could also embed Python code via Jython with a little work, if necessary.

------
pragmatic
wget. Integreated it with a C# application just fine. It outputs the pages to
file and produces a nice parsable crawl log. It is single thread but unless
you're crawling wikipedia, you won't have a problem. Small tools that work
well are a good start. I had problems with many of the multi-threaded
crawlers, they seemed to trip over themselves, wget was fast and rock solid.

------
groovyone
Thanks for these. They both look 'high end' rather than small and
customizable, but I'll check them out. Thanks for the tips and links

