
How to crawl a quarter billion webpages in 40 hours (2012) - allenleein
http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/
======
mindcrime
_Originally I intended to make the crawler code available under an open source
license at GitHub. However, as I better understood the cost that crawlers
impose on websites, I began to have reservations. My crawler is designed to be
polite and impose relatively little burden on any single website, but could
(like many crawlers) easily be modified by thoughtless or malicious people to
impose a heavy burden on sites. Because of this I’ve decided to postpone
(possibly indefinitely) releasing the code._

Given that there are plenty of existing, open-source crawling engines out
there, I don't see how this decision is really accomplishing anything.
Concretely, Apache Nutch[1] can crawl at "web scale" and is apparently the
crawler used by Common Crawl.

 _There’s a more general issue here, which is this: who gets to crawl the
web?_

This, to me, is the most interesting issue raised by this article. In
principle, there's no particular reason that, say, Google, has to dominate
search. If somebody clever comes up with a better ranking algorithm, or some
other cool innovation, they should be able to knock Google off their perch the
same way Google displaced Altavista. BUT... that's only true if anybody can
crawl the web in the first place... OR something like Common Crawl reaches
parity with the Google's of the world, in both volume and frequency of crawled
data.

The first scenario is definitely questionable. Sure, you can plug the
Googlebot user agent string into your crawler, but plenty of sites are smart
enough to look at other factors and will reject your requests anyway. (I know,
I used to work for a company that specialized in blocking bots, crawlers,
etc.)

It really is a bit of a catch-22. Site owners legitimately want to keep bad
crawlers/bots from A. consuming excessive resources, and B. stealing content,
from their sites. But too much of this will lock us into a search oligopoly
that isn't good for anybody (except maybe Google shareholders).

[1]:
[https://en.wikipedia.org/wiki/Apache_Nutch](https://en.wikipedia.org/wiki/Apache_Nutch)

~~~
danielrhodes
This seems like an oversimplification of Google’s value prop. Crawling the web
is a somewhat trivial problem these days - ranking pages, removing spam,
personalizing it to the viewer, and doing so in a matter of milliseconds from
wherever you are in the world is a far greater problem and competitive
advantage.

~~~
mindcrime
_Pretty sure this is an oversimplification of Google’s value prop._

I didn't say anything about Google's value prop. My point is, _whatever_ their
value prop is, it's built on top of their ability to crawl the web at mass
scale, and very quickly. So anybody who wants to compete with Google, by being
better at "ranking pages, removing spam, personalizing it to the viewer," or
whatever, will need to be able to do crawl in a similar manner. IOW, crawling
is part of the "price of admission".

~~~
petters
> IOW, crawling is part of the "price of admission".

Yes, but I think the comment you responded to were saying that it is an
insignificant part these days.

~~~
mindcrime
_Yes, but I think the comment you responded to were saying that it is an
insignificant part these days._

Maybe so. In which case, I'd say I disagree. Great crawling ability is
definitely a necessary, but not sufficient, condition for building a
competitive search engine. And while the technical aspects of building a large
scale search engine have been at least partly trivialized by OSS crawling
software, elastic computing resources in the cloud, etc., what is at issue is
the possibility of site owners _blocking anybody who isn 't
(Google|Bing|Baidu|etc)_.

In this context, that's my concern: being blocked from crawling, if you're not
already on the "allowed" list. Hence my reference to the question quoted
above, from TFA.

~~~
kbenson
I think the terminology you are looking for (or at least that you could use
that might trigger people to accurately infer what you are trying to express),
is that search and crawling are the _foundation_ of everything google has
built, in multiple aspects.

In one aspect, it was literally their beginning, from which they were able to
build a business and expand.

In another, much more on point aspect, it underlies the majority of their
services, either directly of a few steps removed.

Like the foundation of a house, it may not always be the most visible aspect,
and it may be taken for granted, but its contribution to the integrity of the
whole can't be underestimated.

------
dstick
For those curious like I was how much it would cost to scrape the entire
internet with the method and numbers provided:

250.000.000 pages come in at $580

There are 1.8b websites according to [http://www.internetlivestats.com/total-
number-of-websites/](http://www.internetlivestats.com/total-number-of-
websites/)

Lets say on average each site has 10 pages (you have a dozen of huge blogs vs
tens of thousands of onepagers), that would put the number at 18 billion
pages.

Following that logic would mean the total web is 72 times larger than what was
scraped in this test.

So for a mere $41.760 you too can bootstrap your own Google! ;-)

~~~
101km
Your link also says:

> It must be noted that around 75% of websites today are not active, but
> parked domains or similar.

So actually more like 0.5B websites. Feels quite tiny. Seems most activity
online really is behind walled gardens like FB.

~~~
Jacq5
You can check abstract statistics here:
[http://www.businessinsider.com/sandvine-bandwidth-data-
shows...](http://www.businessinsider.com/sandvine-bandwidth-data-shows-70-of-
internet-traffic-is-video-and-music-streaming-2015-12)

So the majority bandwidth will be videos. But as for unique users - social
media and Google does make up the majority.

------
known
In your laptop

Download latest list of urls from [https://www.verisign.com/en_US/channel-
resources/domain-regi...](https://www.verisign.com/en_US/channel-
resources/domain-registry-products/zone-file/index.xhtml)

    
    
      tail -n+149778267 urls.txt | parallel -I@ -j4 -k sh -c "echo @;curl -m10 --compressed -L -so - @ | awk -O -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,\"\"); {print;exit;} }'; echo @;" >> titles.txt;

~~~
codetrotter
Why do you skip so many lines of the file?

------
frogfish
Previous discussions: with [1] 67 comments, [2] 23 comments.

[1]
[https://news.ycombinator.com/item?id=4367933](https://news.ycombinator.com/item?id=4367933)
[2]
[https://news.ycombinator.com/item?id=10865568](https://news.ycombinator.com/item?id=10865568)

------
t0mbstone
Something that has always puzzled me about scrapers is how they avoid scraper
traps.

For example, what if you have a web site that generates a thousand random
links on a page, which all load pages that generate another thousand random
links, to infinity?

~~~
iso1337
You can use the AOPIC algorithm that has a credit and taxation system to
penalize these types of websites.

[https://stackoverflow.com/questions/5834808/designing-a-
web-...](https://stackoverflow.com/questions/5834808/designing-a-web-crawler)

------
yodsanklai
Two very basic questions.

Something I don't understand, how can a webpage "block" a request? is there a
way from a basic http GET request to tell if it has been issued by a browser
or something else?

Lot of webpages are generated dynamically. Consider something like
"[https://news.ycombinator.com/item?id=17462089"](https://news.ycombinator.com/item?id=17462089")?
does the crawler follows URL with parameters? what if parameters identifying
the page are passed in the http request?

~~~
mindcrime
_Something I don 't understand, how can a webpage "block" a request? is there
a way from a basic http GET request to tell if it has been issued by a browser
or something else?_

Sure. In the simplest case you just look at the User-Agent header. Your
browser will send one thing, and something like GoogleBot will send something
else. Now, if you're a website owner who wants to block bots, you can't just
depend on that, because somebody writing a bot can trivially put any string in
that header that they want. But there's a lot of other ways you can tell. A
simple way to discriminate somebody who's just using curl or wget, for
example, is to serve a page with some javascript in it and check if the
javascript is executed or not. Usually you'd do something like this from a
proxy that sits in front of your actual content, and throw out subsequent
requests from a UA that fails the check. Of course identifying the UA
consistently is yet another challenge.. if the thing handles cookies properly,
you can use a cookie. You could try going by IP, that that's dicey in various
ways. Etc., etc., yada, yada.

All in all, there's a constant arms race going on between the companies that
want to block bots / crawlers, and the people who want to crawl/scrape
content. The techniques on both sides are constantly evolving.

------
mgamache
No one brought it up so I will, Google has an algorithm which measures
responsiveness of a site to adjust crawl speed. Google will adjust the crawl
rate so it won't negatively impact performance. It will also crawl your site's
pages more based on updates to content and popularity of URLs. I've written
crawlers, it's not as trivial as some would like to believe, but analyzing the
content to provide relevant search results is 99.99% of the effort of the
Google/Bing teams.

------
ospider
250000000 / (3600 * 40) / 20 = 86

86 pages per machine is not very performant at all, just very simple
parallelism will do.

~~~
natdempk
The units of your equation are pages per second per machine, but I agree.
Reading briefly it seems like he only used 141 threads per machine to do the
actual crawling. This likely could be pushed an order of magnitude further (or
even more!), especially by using green threads. It does seem like he was
running up against CPU constraints and network issues soon after that, but
also this was written in 2012.

~~~
vidarh
Ca. 2007 I was running a RSS feed fetcher that was running 400 processes (not
threads) in parallel on dual CPU Xeon's. The only reason we didn't push it
higher was that 400 in parallel was more than enough for our use at the time.
Of course of those 400 some were always waiting on IO. I never measured how
many feeds we did on average every second.

------
anonu
I've used common crawl [http://commoncrawl.org/.](http://commoncrawl.org/.).
also referenced in the article. This is great work and I would love if they
could get a daily crawl going.

------
Exuma
Did you use a cuckoo filter instead of a bloom filter to manage dynamic
additions?

------
bootcat
This is an amazing read. But i think for people to implement, it should be how
to crawl billion pages on cheaper boxes, may be spanning to a week or
something.

------
wiennat
This is a post from 2012.

------
izzydata
250 million is less characters than quarter billion.

~~~
sbinthree
250 million isn't cool.

