Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In my experience, one of the hardest parts of writing a web crawler is URL selection:

After crawling your list of seed URLs, where do you go next? How do you make sure you don't crawl the same content multiple times because it has a slightly different URL? How to avoid getting stuck on unimportant spam sites with autogenerated content?

Because the author only crawled domains from a limited set and only for a short time, he did not need to care about that part. Nonetheless, it's a great article that shows many of the pitfalls of writing a webcrawler.



Yup. It's very hard to avoid getting stuck in accidental tar pits. One page with inexhaustible URL variations.

  http://example.com/article/story?storyID=19039&ref=329932&sessionID=9043275

  http://example.com/article/story?storyID=19039&ref=902932&sessionID=9023409

  http://example.com/article/story?storyID=19039&ref=904354&sessionID=8230235

  ... infinite ...
Canonical URL meta tags can help.

Filtering out certain query parameters (JSESSIONID, PHPSESSIONID, etc) can help.

Sitemaps help.

I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums, but I'd love to know what the real search engines do.

I do know that a naive crawler will completely fail to crawl any significant portion of the real web without solving this. It's quite possible those 250 million "pages" were actually something like 1 million distinct pages.


>I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums

Interesting point. Q: Is it possible for two different pages to give the same checksum? (Asking for my own info; I do know what checksums are, in overview, from way back, but haven't checked :) them out much.


Given two different pages getting the same checksum should be VERY unlikely. Getting a collision among billions of pages is fairly or very likely depending on the type of checksum. So it is likely to be important to only compare within domains or divide the space by some method.

Also it doesn't help you at all because pages returned that you want to treat as the same aren't actually the same because they include the request time or other unique element.

Edit: Added sentence about comparing with domains.


You could use an edit distance algorithm instead of using checksums. Although that would be really time intensive.


Also, some sites have a LOT of generated pages for a tiny section. For instance, crawling my GitHub account with my homemade crawler was impossible. There is an amazing level of depth there.


"Web traps" and other problems can be controlled by a breadth-first crawling strategy.

A very simple and reliable web crawler can be built by building something that takes a list of in-urls, fetches the pages, parses them, and emits a list of out-urls.

The out-urls from that phase become the in-urls of the next phase.

If you do this you'll find the size of the generation get larger for a while, but eventually it starts getting smaller and if you take a look at it you'll find that almost everything getting crawled is junk -- like online calendars that keep going forward into the future forever.

At some point you stop crawling, maybe between generation 8 and 12, and you've dodged the web trap bullet without even trying.


To build a smart spider, you can have a classifier that determines if you should crawl the outlinks of the page.

This classifier can be trained using logistic regression.

This is slightly tricky, but effective. I've been meaning to write in more depth about this topic (smart crawling).

[edit: I'm stepping out so I can't write more right now. You can email me if you have any questions about this.]


fuzzy hashing and a database? Break the content into parts, and generate scores for each part. calculate a hash entry for each part and enter into a database. Compare against existing entries. If the content is too similar, then reject. Doesnt get you around spam, though.


here's how google avoids duplicate content: http://infolab.stanford.edu/~manku/papers/07www-duplicates.p...


My initial plan, before I started coding, was to use shingling (essentially a type of fingerprinting) to reduce the near-duplicate problem. My early test crawls were straight-up breadth-first crawls, and these also suggested that de-duplication and spam elimination were going to be significant problems. However, after I switched to a domain whitelist and intra-domain crawling my informal tests suggested that the de-duplication and spam elimination problems were significantly reduced. So I decided to leave shingling to a later iteration.

Something I didn't deal with at all was spider traps. My thinking there was that this was a relatively shallow crawl, so the dangers of getting badly stuck weren't too great. For deeper crawls this would be essential.


How about keeping a count of links encountered and use highest to decide what to crawl next. and set a max pages to crawl per site to prevent never ending single site crawls?


>How about keeping a count of links encountered and use highest to decide what to crawl next

Then you have a giant heap with several billion elements that you're going to be updating 100s of entries on each webpage you crawl. I'm pretty sure that'll be your new bottleneck.

Besides, it still doesn't solve the problem of content farms and spam.

>set a max pages to crawl per site to prevent never ending single site crawls?

You either set the limit too low and don't crawl, for example, all of wikipedia, or you set the limit too high and still waste tons of resources.


Set the limit as a function of incoming links - Wikipedia will get a high limit, spam site with low incoming links from the current crawl will get a lower limit.

It's probably best to add an IP based limit factor too to handle link farms.

Something like "max-pages crawled is directly proportional to the (incoming links / linking unique domains)"; or just have it naively proportional to the linking unique domains to reduce processing cost.


> Set the limit as a function of incoming links - Wikipedia will get a high limit, spam site with low incoming links from the current crawl will get a lower limit.

There are span sites with more incoming links than Wikipedia....

Designing crawl policy is a lot like designing security policy. The opposition is very well-funded, smart, experienced, and very motivated.

If you're just smart, they've got you beat.


Can you give examples of spam sites with more inbound linking domains per IP than wikipedia? Have you crawled this info yourself or is it from something like linkscape?

Google apparently solve this with a domain trust factor discounting link juice from low value domains for SERPs purposes but I'm not sure what they feed forward to control their bots.

This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement ... so I'm wondering if such [spam] domain owners are being as smart as you think??


> Can you give examples

It's been several years since I worked on Yahoo's crawler.

It's interesting that at the beginning of this, you doubted that spam domains could have large number of inbound links (and suggested that inbound link counts was a powerful indicator) but by the end, you seemed aware of such domains....


>by the end, you seemed aware of such domains

I'm not. I've never done any research on "spam domains". Do you have any examples or not?


> I'm not.

Then who wrote

> This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement

?


Google use a trust value to rank the domains from which links come to a domain. I don't see what that says about the spaminess of the domain receiving the links. Good quality websites receive a lot of links and what I'm saying here is that it's important when garnering links to ensure that those links are from healthy domains. Inbound links appears still to be one of the most important ranking metrics but there are domains from which one wouldn't wish to be fostering inbound links.


You claimed that bad sites don't have lots of incoming links. Then you wrote about google's counter-measures against bad sites with lots of incoming links....


You're reading in what's not there. Yes, I noted that Google have countermeasures to address sites being more highly ranked than they should be because, whilst they have a lot of inbound links, the inbound links are of low quality (from low trust domains).

But none of that presupposes any knowledge of "spam sites", nor of the particular instances you note of spam sites with lots of inbound links (ie the claimed far more than wikipedia [per IP address]).

But it seems you don't have any examples so I'm not sure why you're persisting. Even if I did know about specific spam sites with lots of incoming links I don't see the relevance of my knowing that to you answering the question of if you can give any examples of sites with your claimed characteristics. The points are orthogonal. Your pre-knowledge of such sites is not in any way bound by my knowledge of such sites.

I daren't ask if you can answer the question again. But just suppose you could then a response would still be of interest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: