Hacker News new | past | comments | ask | show | jobs | submit login
How to crawl a quarter billion webpages in 40 hours (michaelnielsen.org)
323 points by cing on Aug 10, 2012 | hide | past | favorite | 67 comments

In my experience, one of the hardest parts of writing a web crawler is URL selection:

After crawling your list of seed URLs, where do you go next? How do you make sure you don't crawl the same content multiple times because it has a slightly different URL? How to avoid getting stuck on unimportant spam sites with autogenerated content?

Because the author only crawled domains from a limited set and only for a short time, he did not need to care about that part. Nonetheless, it's a great article that shows many of the pitfalls of writing a webcrawler.

Yup. It's very hard to avoid getting stuck in accidental tar pits. One page with inexhaustible URL variations.




  ... infinite ...
Canonical URL meta tags can help.

Filtering out certain query parameters (JSESSIONID, PHPSESSIONID, etc) can help.

Sitemaps help.

I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums, but I'd love to know what the real search engines do.

I do know that a naive crawler will completely fail to crawl any significant portion of the real web without solving this. It's quite possible those 250 million "pages" were actually something like 1 million distinct pages.

>I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums

Interesting point. Q: Is it possible for two different pages to give the same checksum? (Asking for my own info; I do know what checksums are, in overview, from way back, but haven't checked :) them out much.

Given two different pages getting the same checksum should be VERY unlikely. Getting a collision among billions of pages is fairly or very likely depending on the type of checksum. So it is likely to be important to only compare within domains or divide the space by some method.

Also it doesn't help you at all because pages returned that you want to treat as the same aren't actually the same because they include the request time or other unique element.

Edit: Added sentence about comparing with domains.

You could use an edit distance algorithm instead of using checksums. Although that would be really time intensive.

Also, some sites have a LOT of generated pages for a tiny section. For instance, crawling my GitHub account with my homemade crawler was impossible. There is an amazing level of depth there.

"Web traps" and other problems can be controlled by a breadth-first crawling strategy.

A very simple and reliable web crawler can be built by building something that takes a list of in-urls, fetches the pages, parses them, and emits a list of out-urls.

The out-urls from that phase become the in-urls of the next phase.

If you do this you'll find the size of the generation get larger for a while, but eventually it starts getting smaller and if you take a look at it you'll find that almost everything getting crawled is junk -- like online calendars that keep going forward into the future forever.

At some point you stop crawling, maybe between generation 8 and 12, and you've dodged the web trap bullet without even trying.

To build a smart spider, you can have a classifier that determines if you should crawl the outlinks of the page.

This classifier can be trained using logistic regression.

This is slightly tricky, but effective. I've been meaning to write in more depth about this topic (smart crawling).

[edit: I'm stepping out so I can't write more right now. You can email me if you have any questions about this.]

fuzzy hashing and a database? Break the content into parts, and generate scores for each part. calculate a hash entry for each part and enter into a database. Compare against existing entries. If the content is too similar, then reject. Doesnt get you around spam, though.

here's how google avoids duplicate content: http://infolab.stanford.edu/~manku/papers/07www-duplicates.p...

My initial plan, before I started coding, was to use shingling (essentially a type of fingerprinting) to reduce the near-duplicate problem. My early test crawls were straight-up breadth-first crawls, and these also suggested that de-duplication and spam elimination were going to be significant problems. However, after I switched to a domain whitelist and intra-domain crawling my informal tests suggested that the de-duplication and spam elimination problems were significantly reduced. So I decided to leave shingling to a later iteration.

Something I didn't deal with at all was spider traps. My thinking there was that this was a relatively shallow crawl, so the dangers of getting badly stuck weren't too great. For deeper crawls this would be essential.

How about keeping a count of links encountered and use highest to decide what to crawl next. and set a max pages to crawl per site to prevent never ending single site crawls?

>How about keeping a count of links encountered and use highest to decide what to crawl next

Then you have a giant heap with several billion elements that you're going to be updating 100s of entries on each webpage you crawl. I'm pretty sure that'll be your new bottleneck.

Besides, it still doesn't solve the problem of content farms and spam.

>set a max pages to crawl per site to prevent never ending single site crawls?

You either set the limit too low and don't crawl, for example, all of wikipedia, or you set the limit too high and still waste tons of resources.

Set the limit as a function of incoming links - Wikipedia will get a high limit, spam site with low incoming links from the current crawl will get a lower limit.

It's probably best to add an IP based limit factor too to handle link farms.

Something like "max-pages crawled is directly proportional to the (incoming links / linking unique domains)"; or just have it naively proportional to the linking unique domains to reduce processing cost.

> Set the limit as a function of incoming links - Wikipedia will get a high limit, spam site with low incoming links from the current crawl will get a lower limit.

There are span sites with more incoming links than Wikipedia....

Designing crawl policy is a lot like designing security policy. The opposition is very well-funded, smart, experienced, and very motivated.

If you're just smart, they've got you beat.

Can you give examples of spam sites with more inbound linking domains per IP than wikipedia? Have you crawled this info yourself or is it from something like linkscape?

Google apparently solve this with a domain trust factor discounting link juice from low value domains for SERPs purposes but I'm not sure what they feed forward to control their bots.

This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement ... so I'm wondering if such [spam] domain owners are being as smart as you think??

> Can you give examples

It's been several years since I worked on Yahoo's crawler.

It's interesting that at the beginning of this, you doubted that spam domains could have large number of inbound links (and suggested that inbound link counts was a powerful indicator) but by the end, you seemed aware of such domains....

>by the end, you seemed aware of such domains

I'm not. I've never done any research on "spam domains". Do you have any examples or not?

> I'm not.

Then who wrote

> This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement


Google use a trust value to rank the domains from which links come to a domain. I don't see what that says about the spaminess of the domain receiving the links. Good quality websites receive a lot of links and what I'm saying here is that it's important when garnering links to ensure that those links are from healthy domains. Inbound links appears still to be one of the most important ranking metrics but there are domains from which one wouldn't wish to be fostering inbound links.

You claimed that bad sites don't have lots of incoming links. Then you wrote about google's counter-measures against bad sites with lots of incoming links....

You're reading in what's not there. Yes, I noted that Google have countermeasures to address sites being more highly ranked than they should be because, whilst they have a lot of inbound links, the inbound links are of low quality (from low trust domains).

But none of that presupposes any knowledge of "spam sites", nor of the particular instances you note of spam sites with lots of inbound links (ie the claimed far more than wikipedia [per IP address]).

But it seems you don't have any examples so I'm not sure why you're persisting. Even if I did know about specific spam sites with lots of incoming links I don't see the relevance of my knowing that to you answering the question of if you can give any examples of sites with your claimed characteristics. The points are orthogonal. Your pre-knowledge of such sites is not in any way bound by my knowledge of such sites.

I daren't ask if you can answer the question again. But just suppose you could then a response would still be of interest.

> I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added.

other way round? bloom filter provides a low probability of erroneously believing a URL had already been added when it had not, zero probability of believing a URL had not already been added to the filter when in fact it had.

using a bloom filter in this way guarantees you won't ever hit a page twice, but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.

> but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.

My (not-very fresh) memory of what bloom filter actually is tells me that this "non-zero rate" you're talking about must be HUUUUGE. In order of millions of pages. Am I right?

You're right if OP was using a small bloom filter, and wrong if it was a big one. Hence, the phrase, "depending how you tune it."

You can construct a bloom filter with an arbitrarily low error rate.

It's not technically wrong: 0 is indeed a "low probability". It's misleading, though.

> Code: Originally I intended to make the crawler code available under an open source license at GitHub. However, as I better understood the cost that crawlers impose on websites, I began to have reservations. My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.

That is not a good reason. There are many crawlers out there. Anyone can easily modify the string "robots.txt" in the wget binary to "xobots.txt".

Release your code so that others can learn. Stop worrying that you are giving some special tool to bad people - you aren't.

Not to mention if you were malicious, the crawler would probably be inferior to:

    while :; curl "http://www.ycombinator.com"; done
That said, there are many crawlers out there, many of them probably more sophisticated in their ability to ruin someone else's day. Unless you're releasing an exploit, malicious users probably know how to abuse the internet more than you do.

Or even the infamous LOIC

This is a great article. On a side note if you want to do this all day and get paid for it let me know :-) Crawls are the first step of a search engine. Greg Lindahl, CTO at blekko.com) has been writing up a variety of technologies used in our search engine work at High Scalability [1].

One of the most interesting things for me is that a lot of the 'frothiest' web pages (those that change every day or several times a day) have become pretty significant chunk of the web from even 5 years ago. I don't see that trend abating much.

[1] http://highscalability.com/

I'm not sure if it's yet a major issue, but I've been noticing an increasing number of "frothy" sites that are fake-frothy. They change often, but in trivial ways, I assume to try to make themselves seem fresher for SEO reasons. If it were possible to consistently detect them, it might be better to treat them as non-volatile and avoid wasting time re-crawling them.

Thank you very much for the post. I have written a distributed crawler at my startup Semantics3* - we track the price and metadata fields from all the major ecommerce sites.

Our crawler is written in perl. It uses an evented architecture (written using the AnyEvent library). We use Redis to store state (which urls have been crawled - using the hash- and determine which urls to crawl next - using sorted sets)

Instead of using a bloom filter we used sorted sets to dedupe urls and pick the highest priority urls to crawl next (some sort of priority queue).

For the actual distribution of crawling (the 'map reduce' part) we use the excellent Gearman work distribution server.

One major optimization i can suggest is caching the dns (and also do it asynchronously). You can save a lot of time and resources, especially at that scale, by simply caching dns requests. Another optimization would be to keep the socket connection open and do the download of all the pages from the same domain asynchronously.

*Shameless plug: We just launched our private beta. Please sign up and use our API using this link:


Being the CEO of a firm that offers web-crawling services, I found this post very interesting. On 80legs, the cost for a similar crawl would be $500, so it's nice to know we're competitive on cost.

Came here to suggest 80legs too, happy to see you're on it. I would add that if the project was being done for actual work and not just curiosity, then 80legs has the added advantage of not spending all that dev time building and testing it.

Wouldn't the cost of transferring the crawled data (compressed/archived) to AWS/EC2 partially offset the cost saving?

Edit: AWS still has free imbound BW.. my bad

No - inbound data transfer cost to AWS is free.

nice thing about running crawlers from EC2 - all inbound bandwidth is free

Like running crawlers what can be some other use cases where free inbound is a blessing?

Backup/archive, where uploading data is frequent and downloading it is only very occasionally required.

I notice the welcome message and action buttons on the home page are all jpeg images rather than text, which makes it harder for search engines to make a judgement about the content. (The text appears fuzzy as well.) As a developer of a crawler, is there a reason you found this to be beneficial?

(Completely aside: Your front page's colors look strange on my Mac. https://skitch.com/joshua/ecykp/80legs-custom-web-crawlers-p... -- see the blue border?)

Not just your Mac - my Firefox & Chrome on Linux. Seems the color palettes in http://80legs.com/_images/bg-home.jpg vs http://80legs.com/_images/the-most-powerful-web-crawler-ever... don't match up. Looks great on all browsers on Windows, so I'm guessing the jpgs have a color profile attached.

My Master's degree project was a webcrawler. If you're already reading this, the thesis[0] might be a somewhat entertaining read.

I had a bit different constraints (only hitting frontpage, cms/webserver/... fingerprinting, backend has to be able to do ad-hoc queries for site features), but it's nice to see that the process is always somewhat the same.

One of the most interesting things I experienced was, that link crawling works pretty ok for a certain amount of time, but after you have visited a large amount, bloom filters are pretty much the only way to protect against duplication in a memory efficient way.

I switched to a hybrid model where I do still check for links, but to limit the needed depth, I switched to using pinboard/twitter/reddit to find new domains. For bootstrapping you can get your hands on zonefiles from places on the internet (e.g. premiumdrops.com) that will keep you from having to crawl too deep pretty fast.

These days, I run on a combination of a worker approach with redis as a queue/cache and straight elasticsearch in the backend. I'm pretty happy with the easy scalability.

Web Crawlers are a great weekend project, they allow you to fiddle with evented architectures (github sample [1]), scaling a database and seeing the bottlenecks jump from place to place within your architecture. I can only recommend writing one :)

[0] http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo... [1] https://github.com/rb2k/em-crawler-sample

Not bad at all. I build just a few months ago (not publicly released even though I plan) a crawler using NodeJS to take advantage of its evented architecture. I managed to crawl and store (in mongo) more than 300k movies from IMDB in just a few hours (using only a laptop and 8 processes), creating many processes and every one having a specified number of concurrent connections (was based on nodejs cluster and kue lib by learnboost). For html parsing, I used jsdom or cheerio (faster but incomplete), but the process of extracting and storing the data was very faster (prob less than 10 ms for a page). Kue is similar to ruby's resque or python's pyres so the advantage was that every request was basically an independent job using redis as a pubsub.

Even though your implementation is a lot complex and very well documented, IMO using non blocking I/O it's a much better solution, because crawling is very intensive I/O and most of the time is spent with the connection (request + response time). Using that many machines and processes, the time should be much shorter with node.

> I managed to crawl [..] more than 300k movies from IMDB in just a few hours

I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.

Very true indeed. I was also randomly changing user-agents (Mozilla, Safari, Chrome, IE). I thought that this will be harder to tell whether there is a lot of traffic from the same network or someone is just intensively crawling the site.

For me, it was more a proof of how efficient and fast a crawler can be. Also, a response from IMDB was very fast in less than 0.4 seconds, so not that much time was lost there.

Gray hat question out of curiosity and possible experience: did you also use proxies or perhaps even Tor?

so how polite does one need to be? One hit per x seconds?

If the /robots.txt does not mention a Crawl-delay, one page per 3 seconds is often a safe value. Of course this rather heavily depends on the site. In any case, if you have any specific need, always contact the people responsible for the site. I occasionaly run custom queries against the database on request, for example.

I managed to crawl and store (in mongo) more than 300k movies from IMDB in just a few hours

Did you know that IMDB makes a subset of their data publicly available? http://www.imdb.com/interfaces/

Yes, but it's hard to tell how complete and updated that subset is. There is (was until few weeks ago apparently) also http://www.imdbapi.com/ which was retrieving their data by crawling. Unfortunately, it was shut down.

Please, do release it! I'm (or was, decided to go with Apache Nutch for the time being) in the process of creating a similar crawler (with almost the exact same "technologies" you mentioned). It would save me a lot of time and we might be able to help with fixing bug and adding features...

Ok, I'll work then at creating a documentation and adding some tests. The project was written in coffeescript and someone only needs to extend a class and implement 2 methods and a list of starting urls. Using node cluster and concurrent connections I think it can scale very well. I introduced promises (taken from Jquery Deffered) in case someone wanted the writing to DB to be synchronous.

IMO, using kue was a success because it also offers a web interface where you can check the progress and restart/check failed jobs.

Great - I'll be looking forward to it. What's your GitHub username (if you intend to publish it there) so I can follow you to be notified when it's released?

Wow that was informative. I appreciated the author's responsibility the most. Rather than make this a daring adventure or fanciful notion, Nielsen approached the activity with a genuine interest in creating something awesome, not just from the angle of power, though. Great post.

This is not how to crawl webpages. He started with the Alexa list. Those are not necessarily domain names of servers serving webpages. I would guess that some of the request to cease crawling came from some of these listings. Working from the Alexa list he would have been crawling some of the darkest underbelly of the web: ad servers and bulk email services.

His question: "Who gets to crawl the web?" is an interesting one though.

Do not assume that Googlebot is a smart crawler. Or smarter than all others. The author of Linkers and Loaders posted recently on CircleID about how dumb Googlebot can be.

There is no such thing as a smart crawler. All crawlers are stupid. Googlebot resorts to brute force more often than not.

Theoretically no one should have to crawl the web. The information should be organised when it is entered into the index.

Do you have to "crawl" the Yellow Pages? Are listings arranged by an "algorithm"? PageRank? 80/20 rules?

Nothing wrong with those metrics; except of course that they can be gamed trivially, as experiments with Google Scholar have shown. But building a business around this type of ranking? C'mon.

If the telephone directories abandoned alpha and subject organisation for "popularity" as a means of organisation it would be total chaos. Which is why "organising the world's information" is an amusing mission statement when your entire business is built around enabling continued chaos and promoting competition for ranking.

Even worse are companies like Yelp. It's blackmail.

If the information was organised, e.g., alphabetically and regionally, it would be a lot easier to find stuff. Instead, search engines need to spy on users to figure out what they should be letting users choose for themselves. Where "user interfaces" are concerned, it is a fine line between "intuitive" and "manipulative".

The people who run search engines and directory sites are not objective. They can be bought. They want to be bought.

This brings quality down. As it always has for traditional media as well. But it's much worse with search engines.

>Theoretically no one should have to crawl the web. The information should be organised when it is entered into the index.

What do you mean by this statement? I can see from your post you have a contrarian view on how to organize and find stuff on the web, but I'm struggling to understand what alternative you're proposing.

Well, I'm not him but: probably starting from a zone file and narrowing it down to only whitelisted and "legit" domains would be a good start.

Maybe during the registration process more metadata should be demanded of people and anonymity prohibited or reduced. That way for example if you wanted a list of all the .com blogs it is just a grep away and tied into mostly real people for example. Corporate websites tied to their business entity with an EIN or something and verified. 'etc.

The thing is.. that ship has sailed a long time ago so we are stuck with google.

People game the Yellow Pages too. Why do you think all car services are named something like AAA My Stupid Car Service?

Is there any reason you crawled for 40 hours? Was this the optimal cost from Amazon? Why not do this over more time using bid for spot instances?

He mentions spot instances near the end.

tl;dr - He didn't think about it until just before launching the experiment and worried that it would take too much time to understand the implications of changing his approach. Though he estimates it may represent a factor of five decrease in price.

Excellent article. Looks like there was a lot of work involved in the project, about how long did it take you to make this?

Thanks. I came across fabric from this article. very useful for us.

Very nice article on web crawling.

You can make fabric execute commands in parallel. The reliability will be as good as it'll get with chef. I've spent ages dealing with edge cases with both fabric and chef setup systems.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact