
How to crawl a quarter billion webpages in 40 hours - cing
http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/
======
soult
In my experience, one of the hardest parts of writing a web crawler is URL
selection:

After crawling your list of seed URLs, where do you go next? How do you make
sure you don't crawl the same content multiple times because it has a slightly
different URL? How to avoid getting stuck on unimportant spam sites with
autogenerated content?

Because the author only crawled domains from a limited set and only for a
short time, he did not need to care about that part. Nonetheless, it's a great
article that shows many of the pitfalls of writing a webcrawler.

~~~
staunch
Yup. It's very hard to avoid getting stuck in accidental tar pits. One page
with inexhaustible URL variations.

    
    
      http://example.com/article/story?storyID=19039&ref=329932&sessionID=9043275
    
      http://example.com/article/story?storyID=19039&ref=902932&sessionID=9023409
    
      http://example.com/article/story?storyID=19039&ref=904354&sessionID=8230235
    
      ... infinite ...
    

Canonical URL meta tags can help.

Filtering out certain query parameters (JSESSIONID, PHPSESSIONID, etc) can
help.

Sitemaps help.

I'm rather impressed that search engines do it so well. I imagine the right
approach involves examining the contents of the pages and doing checksums, but
I'd love to know what the real search engines do.

I do know that a naive crawler will completely fail to crawl any significant
portion of the real web without solving this. It's quite possible those 250
million "pages" were actually something like 1 million distinct pages.

~~~
hntester123
>I'm rather impressed that search engines do it so well. I imagine the right
approach involves examining the contents of the pages and doing checksums

Interesting point. Q: Is it possible for two different pages to give the same
checksum? (Asking for my own info; I do know what checksums are, in overview,
from way back, but haven't checked :) them out much.

~~~
josephlord
Given two different pages getting the same checksum should be VERY unlikely.
Getting a collision among billions of pages is fairly or very likely depending
on the type of checksum. So it is likely to be important to only compare
within domains or divide the space by some method.

Also it doesn't help you at all because pages returned that you want to treat
as the same aren't actually the same because they include the request time or
other unique element.

Edit: Added sentence about comparing with domains.

~~~
mrkrwtsn
You could use an edit distance algorithm instead of using checksums. Although
that would be really time intensive.

------
brey
> I used a Bloom filter to keep track of which urls had already been seen and
> added to the url frontier. This enabled a very fast check of whether or not
> a new candidate url should be added to the url frontier, with only a low
> probability of erroneously adding a url that had already been added.

other way round? bloom filter provides a low probability of erroneously
believing a URL had already been added when it had not, zero probability of
believing a URL had not already been added to the filter when in fact it had.

using a bloom filter in this way guarantees you won't ever hit a page twice,
but you'll have a non-zero rate of pages you think you've downloaded but you
actually haven't, depending how you tune it.

~~~
pooriaazimi
> _but you'll have a non-zero rate of pages you think you've downloaded but
> you actually haven't, depending how you tune it._

My (not-very fresh) memory of what bloom filter actually is tells me that this
"non-zero rate" you're talking about must be HUUUUGE. In order of _millions_
of pages. Am I right?

~~~
cypherpunks01
You're right if OP was using a small bloom filter, and wrong if it was a big
one. Hence, the phrase, "depending how you tune it."

------
sneak
> Code: Originally I intended to make the crawler code available under an open
> source license at GitHub. However, as I better understood the cost that
> crawlers impose on websites, I began to have reservations. My crawler is
> designed to be polite and impose relatively little burden on any single
> website, but could (like many crawlers) easily be modified by thoughtless or
> malicious people to impose a heavy burden on sites. Because of this I’ve
> decided to postpone (possibly indefinitely) releasing the code.

That is not a good reason. There are many crawlers out there. Anyone can
easily modify the string "robots.txt" in the wget binary to "xobots.txt".

Release your code so that others can learn. Stop worrying that you are giving
some special tool to bad people - you aren't.

~~~
duaneb
Not to mention if you were malicious, the crawler would probably be inferior
to:

    
    
        while :; curl "http://www.ycombinator.com"; done
    

That said, there are many crawlers out there, many of them probably more
sophisticated in their ability to ruin someone else's day. Unless you're
releasing an exploit, malicious users probably know how to abuse the internet
more than you do.

~~~
TazeTSchnitzel
Or even the infamous LOIC

------
ChuckMcM
This is a great article. On a side note if you want to do this all day and get
paid for it let me know :-) Crawls are the first step of a search engine. Greg
Lindahl, CTO at blekko.com) has been writing up a variety of technologies used
in our search engine work at High Scalability [1].

One of the most interesting things for me is that a lot of the 'frothiest' web
pages (those that change every day or several times a day) have become pretty
significant chunk of the web from even 5 years ago. I don't see that trend
abating much.

[1] <http://highscalability.com/>

~~~
mjn
I'm not sure if it's yet a major issue, but I've been noticing an increasing
number of "frothy" sites that are fake-frothy. They change often, but in
trivial ways, I assume to try to make themselves seem fresher for SEO reasons.
If it were possible to consistently detect them, it might be better to treat
them as non-volatile and avoid wasting time re-crawling them.

------
netvarun
Thank you very much for the post. I have written a distributed crawler at my
startup Semantics3* - we track the price and metadata fields from all the
major ecommerce sites.

Our crawler is written in perl. It uses an evented architecture (written using
the AnyEvent library). We use Redis to store state (which urls have been
crawled - using the hash- and determine which urls to crawl next - using
sorted sets)

Instead of using a bloom filter we used sorted sets to dedupe urls and pick
the highest priority urls to crawl next (some sort of priority queue).

For the actual distribution of crawling (the 'map reduce' part) we use the
excellent Gearman work distribution server.

One major optimization i can suggest is caching the dns (and also do it
asynchronously). You can save a lot of time and resources, especially at that
scale, by simply caching dns requests. Another optimization would be to keep
the socket connection open and do the download of all the pages from the same
domain asynchronously.

*Shameless plug: We just launched our private beta. Please sign up and use our API using this link:

<https://www.semantics3.com/signup?code=ramanujan>

------
jdrock
Being the CEO of a firm that offers web-crawling services, I found this post
very interesting. On 80legs, the cost for a similar crawl would be $500, so
it's nice to know we're competitive on cost.

~~~
neurotech1
Wouldn't the cost of transferring the crawled data (compressed/archived) to
AWS/EC2 partially offset the cost saving?

Edit: AWS still has free imbound BW.. my bad

~~~
brey
nice thing about running crawlers from EC2 - all inbound bandwidth is free

~~~
zerop
Like running crawlers what can be some other use cases where free inbound is a
blessing?

~~~
bigiain
Backup/archive, where uploading data is frequent and downloading it is only
very occasionally required.

------
rb2k_
My Master's degree project was a webcrawler. If you're already reading this,
the thesis[0] might be a somewhat entertaining read.

I had a bit different constraints (only hitting frontpage, cms/webserver/...
fingerprinting, backend has to be able to do ad-hoc queries for site
features), but it's nice to see that the process is always somewhat the same.

One of the most interesting things I experienced was, that link crawling works
pretty ok for a certain amount of time, but after you have visited a large
amount, bloom filters are pretty much the only way to protect against
duplication in a memory efficient way.

I switched to a hybrid model where I do still check for links, but to limit
the needed depth, I switched to using pinboard/twitter/reddit to find new
domains. For bootstrapping you can get your hands on zonefiles from places on
the internet (e.g. premiumdrops.com) that will keep you from having to crawl
too deep pretty fast.

These days, I run on a combination of a worker approach with redis as a
queue/cache and straight elasticsearch in the backend. I'm pretty happy with
the easy scalability.

Web Crawlers are a great weekend project, they allow you to fiddle with
evented architectures (github sample [1]), scaling a database and seeing the
bottlenecks jump from place to place within your architecture. I can only
recommend writing one :)

[0] [http://blog.marc-seeger.de/2010/12/09/my-thesis-building-
blo...](http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-
scalable-webcrawler/) [1] <https://github.com/rb2k/em-crawler-sample>

------
alexbardas
Not bad at all. I build just a few months ago (not publicly released even
though I plan) a crawler using NodeJS to take advantage of its evented
architecture. I managed to crawl and store (in mongo) more than 300k movies
from IMDB in just a few hours (using only a laptop and 8 processes), creating
many processes and every one having a specified number of concurrent
connections (was based on nodejs cluster and kue lib by learnboost). For html
parsing, I used jsdom or cheerio (faster but incomplete), but the process of
extracting and storing the data was very faster (prob less than 10 ms for a
page). Kue is similar to ruby's resque or python's pyres so the advantage was
that every request was basically an independent job using redis as a pubsub.

Even though your implementation is a lot complex and very well documented, IMO
using non blocking I/O it's a much better solution, because crawling is very
intensive I/O and most of the time is spent with the connection (request +
response time). Using that many machines and processes, the time should be
much shorter with node.

~~~
yorhel
> I managed to crawl [..] more than 300k movies from IMDB in just a few hours

I suppose IMDB already has a pretty good architecture to handle that load, but
please, if you're crawling from a single site, _be careful_. I host a similar
database myself, and the CPU/load graphs of my server can tell me exactly when
someone has a crawler active again. That's not fun if your goal is to keep a
site responsive while keeping the hosting at low cost.

~~~
alexbardas
Very true indeed. I was also randomly changing user-agents (Mozilla, Safari,
Chrome, IE). I thought that this will be harder to tell whether there is a lot
of traffic from the same network or someone is just intensively crawling the
site.

For me, it was more a proof of how efficient and fast a crawler can be. Also,
a response from IMDB was very fast in less than 0.4 seconds, so not that much
time was lost there.

~~~
binarysolo
Gray hat question out of curiosity and possible experience: did you also use
proxies or perhaps even Tor?

------
veneratio
Wow that was informative. I appreciated the author's responsibility the most.
Rather than make this a daring adventure or fanciful notion, Nielsen
approached the activity with a genuine interest in creating something awesome,
not just from the angle of power, though. Great post.

------
x-header
This is not how to crawl webpages. He started with the Alexa list. Those are
not necessarily domain names of servers serving webpages. I would guess that
some of the request to cease crawling came from some of these listings.
Working from the Alexa list he would have been crawling some of the darkest
underbelly of the web: ad servers and bulk email services.

His question: "Who gets to crawl the web?" is an interesting one though.

Do not assume that Googlebot is a smart crawler. Or smarter than all others.
The author of Linkers and Loaders posted recently on CircleID about how dumb
Googlebot can be.

There is no such thing as a smart crawler. All crawlers are stupid. Googlebot
resorts to brute force more often than not.

Theoretically no one should have to crawl the web. The information should be
organised when it is entered into the index.

Do you have to "crawl" the Yellow Pages? Are listings arranged by an
"algorithm"? PageRank? 80/20 rules?

Nothing wrong with those metrics; except of course that they can be gamed
trivially, as experiments with Google Scholar have shown. But building a
business around this type of ranking? C'mon.

If the telephone directories abandoned alpha and subject organisation for
"popularity" as a means of organisation it would be total chaos. Which is why
"organising the world's information" is an amusing mission statement when your
entire business is built around enabling continued chaos and promoting
competition for ranking.

Even worse are companies like Yelp. It's blackmail.

If the information was organised, e.g., alphabetically and regionally, it
would be a lot easier to find stuff. Instead, search engines need to spy on
users to figure out what they should be letting users choose for themselves.
Where "user interfaces" are concerned, it is a fine line between "intuitive"
and "manipulative".

The people who run search engines and directory sites are not objective. They
can be bought. They want to be bought.

This brings quality down. As it always has for traditional media as well. But
it's much worse with search engines.

~~~
JunkDNA
>Theoretically no one should have to crawl the web. The information should be
organised when it is entered into the index.

What do you mean by this statement? I can see from your post you have a
contrarian view on how to organize and find stuff on the web, but I'm
struggling to understand what alternative you're proposing.

~~~
eugenes
Well, I'm not him but: probably starting from a zone file and narrowing it
down to only whitelisted and "legit" domains would be a good start.

Maybe during the registration process more metadata should be demanded of
people and anonymity prohibited or reduced. That way for example if you wanted
a list of all the .com blogs it is just a grep away and tied into mostly real
people for example. Corporate websites tied to their business entity with an
EIN or something and verified. 'etc.

The thing is.. that ship has sailed a long time ago so we are stuck with
google.

------
mey
Is there any reason you crawled for 40 hours? Was this the optimal cost from
Amazon? Why not do this over more time using bid for spot instances?

~~~
waterlesscloud
He mentions spot instances near the end.

tl;dr - He didn't think about it until just before launching the experiment
and worried that it would take too much time to understand the implications of
changing his approach. Though he estimates it may represent a factor of five
decrease in price.

------
zaptheimpaler
Excellent article. Looks like there was a lot of work involved in the project,
about how long did it take you to make this?

------
zerop
Thanks. I came across fabric from this article. very useful for us.

------
praveenhm
Very nice article on web crawling.

------
marklit
You can make fabric execute commands in parallel. The reliability will be as
good as it'll get with chef. I've spent ages dealing with edge cases with both
fabric and chef setup systems.

[http://morgangoose.com/blog/2010/10/08/parallel-execution-
wi...](http://morgangoose.com/blog/2010/10/08/parallel-execution-with-fabric/)

