

Writing a Generalized Web Crawler to map all the URLs on web sites? - JackInFact

I want to crawl web sites to find all all the pages on each site (domain). My simplistic attempts were quickly foiled by weird URL structures of sites. Many sites add dynamic query parameters to URLs (session ID, refering page, etc) effectively creating millions of unique URLs.<p>Is there a good strategy to find every page without getting caught in accidental tarpits like this?<p>Ideally I'd love to find some existing code that solves the problem of crawling web sites in a generalized way. Second best would be finding some great texts on the subject that I can work from while creating my own code.<p>Thanks in advance for any help.
======
byoung2
Your script should be able to accept a list of parameters to exclude, and then
you can use regular expressions to strip them out. For example, if the urls
look like
www.example.com/pages/category?section=tutorials&sessionID=12345&refID=home,
you could exclude (sessionID, refID) while retaining other parameters (e.g.
"section" in this example).

If these pages were created using canonical URLs, you could find and extract
that and add that to your index.

~~~
JackInFact
I considered that. It seems like the only solution (to me). When the <meta>
tag for a canonical tag is available it seems pretty easy to solve that way.

I'm really curious how Google (and other search engines) solve this problem.

Do they maintain a massive list of query parameters to ignore? JSESSIONID and
the like are easy enough, but there are many others.

Many pages don't define a canonical URL with a <meta> tag either.

~~~
byoung2
_I'm really curious how Google (and other search engines) solve this problem.

Do they maintain a massive list of query parameters to ignore? JSESSIONID and
the like are easy enough, but there are many others_

I would imagine when google crawls a page, they analyze the content and
compare it to other pages with similar URLs. If the content is identical to
another page with the same URL, they look for a canonical URL and index using
that. If not, they pick one URL to represent the many identical pages. Over
time crawling a site, they would know which parameters to ignore.

