
How to write a crawler - EmanueleMinotto
http://www.emanueleminotto.it/how-to-write-a-crawler
======
AcoonDe
There are a few things that I hope you just left out in your description...

\- There is no mention of implementing a crawl-delay. You should _always_ wait
for several seconds (better yet, a minute) between requests to the same host.

\- Do you follow redirects when requesting the robots.txt? You should! Some
sites send you a redirect to a different URL even for robots.txt. In most
cases it is just a slightly different hostname, like www.domain.com instead of
domain.com. But it can redirect you to somewhere completely different in some
cases.

\- You probably don't want to crawl anything that ends with .jpg, .gif, and
definitely not something like .avi, .wmv or .mkv. There are a LOT more file-
extensions that you'll want to ignore.

I agree with cmiles74 that using a database is probably a bad idea. For a
sizeable crawl (say a billion pages) this database will get pretty damn big. I
doubt that you will be able to get decent performance out of anything with
"SQL" in its name for such a use-case, unless you throw a ton of hardware at
it. Building your own specialized solution for this would probably be a lot
faster and less resource-intensive.

~~~
at-fates-hands
>>>You should always wait for several seconds (better yet, a minute) between
requests to the same host.

This a thousand times.

I learned this lesson AFTER getting several angry emails from Admins and
getting outright banned from one site for not having any delay between the
requests in the first crawler I built.

------
asperous
I wanna mention Nutch here:
[http://nutch.apache.org/](http://nutch.apache.org/) since it has been around
for a while and a lot of thought was put into its design. For instance, people
are discussing data stores, Nutch uses Hadoop.

The web is probably bigger than you think, Google says "when our systems that
process links on the web to find new content hit a milestone: 1 trillion (as
in 1,000,000,000,000) unique URLs on the web at once!" (July, 2008)

You might consider just crawling certain parts of the web, or using a search
engine (api, like Yahoo! BOSS) to gather relevant links and crawl from there,
using a depth limit. Just an idea.

~~~
KMag
I used to be on the Google indexing team. Disregarding limits on the length of
URLs, the size of the visible web is already infinite. For instance, there are
many calendar pages out there that will happily give you month after month ad
infinitum if you keep following the "next" link.

Now, depending on how you prune your crawl to get rid of "uninteresting"
content (such as infinite calendars) and how you deduplicate the pages you
find, you'll come up with vastly varying estimates of how big the visible web
is.

Edit: on a side note, don't crawl the web using a naive depth-first search.
You'll get stuck in some uninteresting infinitely deep branch of the web.

~~~
EmanueleMinotto
You're right, I forgot to write it explicitly in the article but if someone
will follow istructions (extract all <a> tags and add them to the index) that
method is tacit.

------
yogo
You definitely want to store raw text in flat files and store _metadata_ in a
database. By metadata I'm not referring to only the values found in the html
page's meta tags but other things like page hash, word count, link count, etc.
It all depends on what you are doing. If it's in a database you will have a
very very big file for that archives table (mongodb and mysql+innodb come to
mind).

~~~
riffraff
in a couple of projects I worked on, we also stored visited urls in a set of
bloom filters, also stored in flat files on disk.

At some point querying the db to check what URLs you have can become quite
heavy

------
cmiles74
I respectfullly disagree. If ever there was a use-case for a NoSQL storage
solution, web crawling certainly seems to be it. I've used Elasticsearch for
indexing and Cassandra for storage, performance was more than good enough for
our use-cases. It was easy to scale, as well.

~~~
EmanueleMinotto
A NoSQL solution is good because it's a DBMS (allows you to order
collections). :) Filesystem is not good because you would need to order links'
files in visited descendent order (not allowed in much filesystems), and to
check if an URL is in the index you must store it with the MD5 as file name. A
small DBMS like SQLite is not good for obvious reasons.

~~~
nostrademons
I would highly recommend storing the working set of links in RAM (with
checkpointing to write it out to disk periodically). A Redis Set (for visited
links) + Sorted Set (for unvisited links, ordered by priority) is perfect for
this, since it lets you take up one full machine's RAM and does checkpointing
automatically. If your crawl is too big to fit in RAM, get more machines and
shard by URL hash. As others have pointed out, the file content itself should
go in files, ideally ones that you can write to with straight appends.

The reason you don't want to hit the disk with each link (as both MySQL and
PostGres usually do, barring caching) is that there can be hundreds to
thousands of links on a page. A disk hit takes ~10ms; if you need to run
hundreds of those, it's well over a second _per page_ just to figure out which
links on it are unvisited. Accessing main memory is about 100,000 times
faster; even with sharding and RPC overhead for a distributed memory cache,
you end up way ahead.

The reason to write the crawl text to an append-only log file is because disk
seek times are bound by the rotation speed of the disk, which hasn't changed
much recently, while disk bandwidth is bound by the rotation time of the disk
divided by capacity, which has gone way up. So appends are much more efficient
on disk than seeks are.

------
rzendacott
Udacity CS101 [1] also goes through the basics of building a web crawler. It's
a lot more lightweight (no backend, etc), but it's a fun overview and can be
completed pretty quickly.

[1]:
[https://www.udacity.com/course/cs101](https://www.udacity.com/course/cs101)

~~~
mattcanhack
I took this when it first came out, definitely a good course for beginners or
intermediates.

------
staunch
The only complicated part that I've run into when writing crawlers is
accidental tarpits. It's very easy to run into a situation in which you're
repeatedly requesting the same content via many different URLs.

For example, when a tracking parameter is added to any URL within the site:

[http://example.com/?cid=104484&pid=12002348&ref=1294902](http://example.com/?cid=104484&pid=12002348&ref=1294902)

[http://example.com/?cid=104484&pid=12002348&ref=1294904](http://example.com/?cid=104484&pid=12002348&ref=1294904)

[http://example.com/?cid=104484&pid=12002348&ref=1294905](http://example.com/?cid=104484&pid=12002348&ref=1294905)

[http://example.com/?cid=104484&pid=12002348&ref=1294906](http://example.com/?cid=104484&pid=12002348&ref=1294906)

You can quickly get to billions of permutations for a single site. The
canonical tag solves the problem when it's there, but I still haven't seen a
simple solution to the problem when it's not.

~~~
EmanueleMinotto
If it's not included in robots.txt rules and doesn't have a canonical link
that's not a problem, because the bot can't know if those pages are different
or not so those pages you linked are different. This is the reason why
crawlers can't try to fill forms.

If you are really sure that these pages are the same, try checking the body
content (if two or more pages have the same MD5 of the content, those pages
are the same) or look for a form that generate those URLs.

~~~
novaleaf
great tip on the md5, thanks

------
peterwwillis
Crawlers are one of those projects that's honestly best left to someone else.
Fun for a hobby, but a nightmare to get right, and someone has already done
the work for you. The exception is limited-use tools like Wget that can give
you practical results for small-domain retrieval, but then kill you on CPU and
memory and is impossible to scale; use a better tool or customize an existing
one if you need to support large-scale crawls.

Some of the "little things" matter much more than your content analyzer or
HTTP parsing - DNS performance and multi-homing being just a few that can have
drastic effects.

Just as an example of how complex it gets, here's a brief overview of _some_
of the features all crawlers should take into account:
[http://en.wikipedia.org/wiki/Web_crawler](http://en.wikipedia.org/wiki/Web_crawler)

------
drakaal
The Content Extractor is overly simple. You don't want all the Anchor tags for
most things, you want the ones that aren't part of the page template.

It took a lot of man hours to build the content extractor we use for our
search engine.

[https://www.mashape.com/stremor/stremor-content-
extractor](https://www.mashape.com/stremor/stremor-content-extractor)

Also because not every site has a SiteMap, or well linked site structure you
may have to turn to Social like FB and Twitter if you want to get everything.

------
gpsarakis
Are you using MySQL in your example?

    
    
      > id is an incremental value, I choose 11 as a length 
      for this primary key but this value is defined 
      by the number of pages you’ll need to index
    

This is a bit confusing. You'd generally be better off with _INT UNSIGNED_ as
it doubles the range for auto-increment columns.

Also the _visited_ field would be better to be represented by a timestamp,
choosing the right datatype _does_ matter in large tables.

------
gondo
couple of additional points: \- encoding - your input will have different
encodings and its quite hard to guess the correct one, however you should at
least try to convert everything into 1 encoding (f.e. utf8)

\- by setting CURLOPT_ENCODING to '' you don't have to worry about
(un)gzipping as curl ll do this for you (or it should)

\- it might be a good idea to use url hash as url id (f.e. crc32)

\- you should check content-length and content-type to avoid downloading huge
files

btw your coding style is very disturbing. there shouldn't be spaces before or
after ->

------
elchief
crawler4j is a nice open source (Apache 2) Java web crawler.

Multi-threaded. Built-in page delay (200ms default). Does HTTPS, headers,
POST, cookies, follows robots.txt .

I prefer it to nutch for small to medium sized jobs.

[https://code.google.com/p/crawler4j/](https://code.google.com/p/crawler4j/)

------
ghostdiver
PHP + MySQL is the most unfortunate technology stack for this project.

