
Ask YC: Where to start with creating a distributed crawler - groovyone
Hi there. We're just starting out and want to create a crawler that will sit on EC2.  Any advice appreciated. Here's what we're thinking of:<p>1. Using Beautiful Soup for the actual parsing of pages<p>2. we're not sure what to use for the crawl itself :( We use Python and love it, but don't know if we need to create our own crawler or what the best route would be. Any advice on this would be good<p>3. I'd like to create a distributed crawler where we can replicate the crawler over EC2 instances, but not sure how to do this<p>Apologies if I should ask this elsewhere. I love this community and have passively read many of the articles and comments on here for a couple of months now<p>Any help or pointing in the right direction would be appreciated<p>John
======
pierrefar
How you build it will depend on the service you host it on. So with AWS,
you'll use EC2, S3 and perhaps most importantly, SQS.

Fundamentally for a crawler, you will need the following:

1\. A list of URLs to crawl, perhaps even ranked in priority of crawling. This
is a database of sorts.

2\. A set of crawlers that figure out the most important URL on the list and
fetch it.

3\. A parser and HTML storage service. The parser will also feed new URLs into
the list.

Each of the above pieces are easy to do on their own. The trick is how you
glue them together. I would suggest something like the following as a starter
for using AWS for crawling:

1\. A MySQL list of URLs with some kind of priority ranking. This can be a
cluster of EC2 instances that store and prioritize the links. Early on, you
can ignore the prioritization aspects.

2\. The URL cluster will dispatch queue messages of URLs to crawl in the
desired crawling order.

3\. A cluster of EC2 instances check the SQS queue for crawling messages and
fetch the URL specified in each message. As the message is being processed,
it's locked so others can move on.

You can make the whole thing dynamic by adding crawling instances if the queue
gets too long. You can also have instances that determine the crawling
priority for the next time (one metric is number of backlinks to a page).
Another set of instances might be parsers or do the actual analysis of the
crawled pages.

Which language to code it in? If you're going for maximal speed, perhaps you
should consider a compiled language. If not, Python or PHP or Perl would do
just fine. Personally I'd do it in a scripting language to begin with and
invest the time into a faster crawler later if warranted.

And good luck!

~~~
jwp
Creating the list of URLs and prioritizing them is the hardest thing about
building a crawler! That is, a good, web-scale one. A replacement for wget
might be sort of fun, but the real way to make a fast crawler is to be choosy
about which pages get updated frequently, which are likely to contain good
content (by computing a pagerank-like stat on the fly), etc.

It is far from my area of expertise, but the Wikipedia page about this looks
very useful. It cites a bunch of wicked smart people.
<http://en.wikipedia.org/wiki/Web_crawler>

If you just want to suck down a bunch of pages, then there's nothing wrong
with wget.

------
aonic
Check out Nutch, not sure if its exactly what you want though. It's in Java
not Python, but it works with Hadoop quite nicely

~~~
sarosh
I concur with the Nutch vote; but more specifically, take a look at the
crawler code written in the src trunk for use with Hadoop. That is probably a
good place to start. Also worth a look is Heritrix (crawler for archive.org).
<http://sourceforge.net/projects/archive-crawler> Sadly, this too is written
in Java.

The only Python one I am aware of for which code is available is:
<http://sourceforge.net/projects/ruya/>

Edit: You might also want to take a look at
<http://wiki.apache.org/hadoop/AmazonEC2>

Edit2: Polybot is another Python based crawler, but no code. However, the
paper has some interesting ideas:

Design and Implementation of a High-Performance Distributed Web Crawler. V.
Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering,
February 2002. <http://cis.poly.edu/westlab/polybot/>

~~~
inovica
Good response. We've created a basic crawler in Python, but are looking for
something more powerful too. Heritrix above looks good

------
bdr
BeautifulSoup is not foolproof, meaning it does not always even approximate
the way a browser would parse the HTML. One important failure is that it fails
to recognize when HTML tags are inside of JavaScript strings (and so should
not be considered). Whether this is important or not depends on your
application.

------
iowahansen
1.) Don't do it yourself. Use Amazon's Alexa Web Search service
(aws.amazon.com). Through that you can access Alexa's 10 billion page index,
complete with all the pages, run complex queries etc. Plays nicely with EC2.

2.) If you must do it yourself, Heritrix is the most sophisticated crawler out
there (crawler.archive.org).

3.) Nutch is an option, but nowhere near as powerful as Heritrix.

Don't try to reinvent the wheel, writing a robust crawler is a lot of work as
there are endless edge cases to take care of (if you are looking into a
general purpose web crawler)

~~~
sonink
Nutch is good and I would second it, but I would suggest to NOT build a
crawler - its not trivial and inadvised in a startup, that is unless your
startup is just about building a crawler.

------
wehriam
I have recently written a Beautiful Soup / Twisted crawler. To make it
distributed, presumably we'd use Amazon's queue service.

Feel free to get in touch if you're interested in the details.

~~~
groovyone
Thanks John. I'll do that. Your resume looks awesome mind. Not sure how much
we'd be able to help you!

------
gojomo
Thanks for the previous positive comments about Heritrix, which is my project
at the Internet Archive. If anyone has questions, please send them my way.

Heritrix was designed for archival projects, which has meant an emphasis on
having a "true record" (including non-text resources) and high configurability
for inclusion/exclusion. Any text indexing or link-graph-analysis is
completely external; we've used Nutch (without their crawler) for that.

Whole-web multi-billion-page crawls have not been the focus yet, though we've
tried one and have heard of outside groups successfully using Heritrix for 2+
and 4+ billion page crawls.

Our distribution story is spotty; we provide some options that help you split
the URL-space you want to crawl across crawlers, and remote-control crawlers
from other programs, but syncing their launch and other steps is left to an
expert operator's own devices. We've run coordinated Heritrix crawls on groups
of 4-8 machines (dual-opteron, 4G+ RAM, 4x500GB+ HDs) and understand others
have used up to 12.

------
xirium
Given the way that Google is heading (
<http://news.ycombinator.com/item?id=149894> ), you should have started at
least six months ago. Regardless:

1\. 10 years ago, at least 99% of web pages failed validation. Nowadays, the
majority still fail. You could validate and then fall through to tag soup
processing.

2\. 10 years ago, the conventional wisdom (
[http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO...](http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC)
) was to use a compiled language, such as C, for spidering (
<http://www.tbray.org/ongoing/When/200x/2003/12/03/Robots> ). Given that
memory increases faster than processing power which increases faster than
bandwidth, this may not be the case nowadays.

3\. That's the meta problem. Solve that and you may find that a search engine
is easier.

------
Readmore
I recently built a crawler of my own in Ruby after trying, and ultimately
deciding against, Nutch. Depending on what you want to do with your crawl
there is a very good change that you'll be able to write a small crawler that
is much easier to extend on your own, and you'll probably be able to write it
in the time it would take you to install, setup, and configure Nutch.

As I said I used Ruby and specifically Hpricot for the page parsing. I'm
starting to run into problems with Hpricot right now though and I may actually
try a python version with Beautiful Soup very soon. Let me know how it goes
for you and maybe we can share some code.

~~~
groovyone
Hi there. I think I've decided we need to build our own, mainly as what we're
wanting to do is quite specific - monitoring of sites for keywords and really
we should understand,to a degree, this technology. Happy to share if you end
up heading down the Python route. If there is anyone else on here who is doing
crawling/data mining then maybe we could share ideas, help each other somehow
:) My email is in my profile

~~~
Readmore
Have you seen this article yet? Looks like it would be useful to you.
[http://blog.ianbicking.org/2008/03/30/python-html-parser-
per...](http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/)

------
krishna2
+10 for Python +1 for Nutch +1 for Hadoop +1 for Amazon EC2

I think you have bundled two things (crawler and parser (or may be scraper))
into one term : crawler.

Beautiful soup is ok. Give html5lib a try (on google code) - but at some point
you are going to have to hack the parser, but that depends on what kind of
parsing you want to do.

~~~
groovyone
Thanks for that suggestion. Will have a good look at it.

------
inovica
Look at Harvestman. Quite useful

~~~
groovyone
I looked, but this is not distributed, so currently (whilst good) its limited
to one server

