Hacker News new | past | comments | ask | show | jobs | submit login
How to crawl a quarter billion webpages in 40 hours (2012) (michaelnielsen.org)
296 points by allenleein 8 months ago | hide | past | web | favorite | 61 comments

Originally I intended to make the crawler code available under an open source license at GitHub. However, as I better understood the cost that crawlers impose on websites, I began to have reservations. My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.

Given that there are plenty of existing, open-source crawling engines out there, I don't see how this decision is really accomplishing anything. Concretely, Apache Nutch[1] can crawl at "web scale" and is apparently the crawler used by Common Crawl.

There’s a more general issue here, which is this: who gets to crawl the web?

This, to me, is the most interesting issue raised by this article. In principle, there's no particular reason that, say, Google, has to dominate search. If somebody clever comes up with a better ranking algorithm, or some other cool innovation, they should be able to knock Google off their perch the same way Google displaced Altavista. BUT... that's only true if anybody can crawl the web in the first place... OR something like Common Crawl reaches parity with the Google's of the world, in both volume and frequency of crawled data.

The first scenario is definitely questionable. Sure, you can plug the Googlebot user agent string into your crawler, but plenty of sites are smart enough to look at other factors and will reject your requests anyway. (I know, I used to work for a company that specialized in blocking bots, crawlers, etc.)

It really is a bit of a catch-22. Site owners legitimately want to keep bad crawlers/bots from A. consuming excessive resources, and B. stealing content, from their sites. But too much of this will lock us into a search oligopoly that isn't good for anybody (except maybe Google shareholders).

[1]: https://en.wikipedia.org/wiki/Apache_Nutch

This seems like an oversimplification of Google’s value prop. Crawling the web is a somewhat trivial problem these days - ranking pages, removing spam, personalizing it to the viewer, and doing so in a matter of milliseconds from wherever you are in the world is a far greater problem and competitive advantage.

Pretty sure this is an oversimplification of Google’s value prop.

I didn't say anything about Google's value prop. My point is, whatever their value prop is, it's built on top of their ability to crawl the web at mass scale, and very quickly. So anybody who wants to compete with Google, by being better at "ranking pages, removing spam, personalizing it to the viewer," or whatever, will need to be able to do crawl in a similar manner. IOW, crawling is part of the "price of admission".

> IOW, crawling is part of the "price of admission".

Yes, but I think the comment you responded to were saying that it is an insignificant part these days.

Yes, but I think the comment you responded to were saying that it is an insignificant part these days.

Maybe so. In which case, I'd say I disagree. Great crawling ability is definitely a necessary, but not sufficient, condition for building a competitive search engine. And while the technical aspects of building a large scale search engine have been at least partly trivialized by OSS crawling software, elastic computing resources in the cloud, etc., what is at issue is the possibility of site owners blocking anybody who isn't (Google|Bing|Baidu|etc).

In this context, that's my concern: being blocked from crawling, if you're not already on the "allowed" list. Hence my reference to the question quoted above, from TFA.

I think the terminology you are looking for (or at least that you could use that might trigger people to accurately infer what you are trying to express), is that search and crawling are the foundation of everything google has built, in multiple aspects.

In one aspect, it was literally their beginning, from which they were able to build a business and expand.

In another, much more on point aspect, it underlies the majority of their services, either directly of a few steps removed.

Like the foundation of a house, it may not always be the most visible aspect, and it may be taken for granted, but its contribution to the integrity of the whole can't be underestimated.

Web crawling is necessary, but not sufficient. You won't be able to build a better Google just because you can crawl the web, but you won't be able to build a better Google if you can't crawl the web.

...which makes about as much sense as saying breathing is the least significant thing a human does to someone bemoaning a lack of oxygen.

You fail to understand that crawling the web is 0.5% of the picture. Google has the best treatment of that data to serve ads, personalise experience and that's what their value is. You think making a bit crawler would nullify google, but you'd have to do the other 99.95% to be better (which costs money most people don't have).

You think making a bit crawler would nullify google

No, I don't. I didn't say anything even remotely like that. What I'm saying is that crawling is a prerequisite for all of the "... best treatment of that data to serve ads, personalise experience ..." stuff. That's why this topic matters: because if people aren't allowed to crawl (or can't do it effectively for technical reasons, but that's a subtly different issue), then they can't do anything else.

Just to be clear, I am most emphatically not saying that just building a crawler is enough to compete with Google. What I'm saying is that you can't build a competitive search engine without that "0.05%" bit because it's required to enable the other 99.95%.

Why does everything have to have a "value proposition". Quite narrow minded.

Crawl, but share.

Another possible approach is for sites to be self-indexing, with sharable indices, and validation (and penalties) for deceptive practices (either false term inclusion or exclusion).

The simple existence of a websearch and index protocol would wipe virtually all of Google's present value.

Or possibly a common open resource from which services can pull regularly updated crawl data. I'm not sure if what I am thinking of exists or is feasible or not, just a thought from someone with little experience in crawling in general. Although, then that would put load on a single source and someone has to pay for those resources. Maybe if there was a distributed resource then, I wonder if block-chain technology or something similar factor into such a niche.

Right: that's "crawl, but share".

Both models are fairly isomorphic, modulo the crawl / search locus. The key is on agreed data structures, practices, and access.

That sounds a lot like what CommonCrawl does, and they are a good start. But from what I've seen, if you wanted to build a really competitive search engine, you'd still have to complement that with your own crawling. Not sure though, as I haven't done a deep dive into this topic.

>but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.

I came here to state just this. What would code for the crawler in the article do that premade DDoS bots don't already do? (already does that task better too) Even in 2012 when the article was published, there were plenty of open source crawlers AND tools designed specifically for "burdening websites with traffic" that were available.

I don't see any issue at all if the bot is respecting the robots.txt file. Any malicious user will figure some other evil way anyways, be it Nutch or a network of intelligent lightbulbs.

so you will actually respect:

    User-agent: google
    Allow: /
    User-agent: *
    Disallow: /
well, yes it's good behavior to actually respect it, but well I've seen such robots.txt already which makes it really painful to create a competing search engine.

This has me ridiculously curious now. Is that common? Other than a random sampling of sites I go to, there a good way to get numbers on how often this is used?

Edit: In my scanning, I have to confess that wikipedia's robot file is the best. Fairly heavily commented on why the rules are there. https://en.wikipedia.org/robots.txt

I analyzed the top 1 million robots.txt files looking for sites that allow google and block everyone else here: https://www.benfrederickson.com/robots-txt-analysis/ - it's a relatively common pattern for major websites

I did run a Yacy web crawler (P2P websearch https://yacy.net) a while ago. As far I remember I just saw Yandex for a few times disallowed in the robots.txt when I had trouble with crawling a site. Mostly I just got an empty website for my Yacy crawler instead the "real" Website.

just do like browsers did with user agent strings. call your bot "botx (google crawler compatible)" and crawl everything that allows Google bot without any weight on your conscience.

Everyone gets to crawl. The net should have no restrictions, period

Does that lack of restrictions include allowing me to decide who has access to things I make available?

Nothing stops you from not serving those who you don't want to serve.

Crawling can be quite debilitating to sites.

Building a better search engine than Google is doable. Maybe it is not even very expensive - something in the order of tens/hundreds of millions might be enough.

The only problem is: nobody would care, people will still use Google because it's all they know. Some people even ignore the fact that underneath they use an operating system and a browser.

> because it's all they know.

Don't you think that's what they used to say about Altavista?

For those curious like I was how much it would cost to scrape the entire internet with the method and numbers provided:

250.000.000 pages come in at $580

There are 1.8b websites according to http://www.internetlivestats.com/total-number-of-websites/

Lets say on average each site has 10 pages (you have a dozen of huge blogs vs tens of thousands of onepagers), that would put the number at 18 billion pages.

Following that logic would mean the total web is 72 times larger than what was scraped in this test.

So for a mere $41.760 you too can bootstrap your own Google! ;-)

I'm chiming in here since my employer has a few web archives from the IA and some other organizations.

That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.

But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.

There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.

OT but why does postbank.de have over a million pages? Is there much|anything worth crawling there?

We never checked the details there. You can use https://web.archive.org/details/www.postbank.de and https://web.archive.org/web/*/postbank.de/* if you want to go exploring.

I assume a lot will be incorrect links and automatically generated pages as is often the case.

Your link also says:

> It must be noted that around 75% of websites today are not active, but parked domains or similar.

So actually more like 0.5B websites. Feels quite tiny. Seems most activity online really is behind walled gardens like FB.

You can check abstract statistics here: http://www.businessinsider.com/sandvine-bandwidth-data-shows...

So the majority bandwidth will be videos. But as for unique users - social media and Google does make up the majority.

I get what you're saying about it feeling quite tiny; however, look at it this way: that's like one active website for every 15 people on the planet!

D'oh you're right, should've caught that :-)

Crawling may be cheap, but you also want to save that data and make it queryable without waiting minutes for the response to a query. That makes it way more expensive.

> dozen of huge blogs vs tens of thousands of onepagers

Try crawling ONE wordpress blog with less than 10 posts and you will be surprised just how many pages there are due to pagination on different filters and sorting options, feeds, media pages, etc.

> So for a mere $41.760 you too can bootstrap your own Google! ;-)

I think the cost of fetching and parsing the data is much less than the cost of building an index and an engine to execute queries against that massive index.

Haha I love the math and making it concrete :)

Yeah, the guy ripped himself off. You can crawl it yourself for next to nothing from home. I think everyone's written their own crawler at some point, it's literally web 101.

In your laptop

Download latest list of urls from https://www.verisign.com/en_US/channel-resources/domain-regi...

  tail -n+149778267 urls.txt | parallel -I@ -j4 -k sh -c "echo @;curl -m10 --compressed -L -so - @ | awk -O -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,\"\"); {print;exit;} }'; echo @;" >> titles.txt;

Why do you skip so many lines of the file?

Previous discussions: with [1] 67 comments, [2] 23 comments.

[1] https://news.ycombinator.com/item?id=4367933 [2] https://news.ycombinator.com/item?id=10865568

Something that has always puzzled me about scrapers is how they avoid scraper traps.

For example, what if you have a web site that generates a thousand random links on a page, which all load pages that generate another thousand random links, to infinity?

You can use the AOPIC algorithm that has a credit and taxation system to penalize these types of websites.


I used to have what was basically a spider trap on my website while learning Django. The truth is that every bot that found it hammered my site with tens of thousands of requests per day (even Googlebot). I don't think it stopped until I blocked the page in robots.txt

A well-behaved crawler will be self-limiting in this regard in that they seek to avoid requesting URLs from the same domain too often, and then the better you are at spreading requests over as many domains at possible, the less you'll be affected by such traps.

You can do a lot of other things too, but the above already takes most of the sting out of such traps unless you use a huge number of domains.

Just have a maximum traversal depth

No one brought it up so I will, Google has an algorithm which measures responsiveness of a site to adjust crawl speed. Google will adjust the crawl rate so it won't negatively impact performance. It will also crawl your site's pages more based on updates to content and popularity of URLs. I've written crawlers, it's not as trivial as some would like to believe, but analyzing the content to provide relevant search results is 99.99% of the effort of the Google/Bing teams.

Two very basic questions.

Something I don't understand, how can a webpage "block" a request? is there a way from a basic http GET request to tell if it has been issued by a browser or something else?

Lot of webpages are generated dynamically. Consider something like "https://news.ycombinator.com/item?id=17462089"? does the crawler follows URL with parameters? what if parameters identifying the page are passed in the http request?

Something I don't understand, how can a webpage "block" a request? is there a way from a basic http GET request to tell if it has been issued by a browser or something else?

Sure. In the simplest case you just look at the User-Agent header. Your browser will send one thing, and something like GoogleBot will send something else. Now, if you're a website owner who wants to block bots, you can't just depend on that, because somebody writing a bot can trivially put any string in that header that they want. But there's a lot of other ways you can tell. A simple way to discriminate somebody who's just using curl or wget, for example, is to serve a page with some javascript in it and check if the javascript is executed or not. Usually you'd do something like this from a proxy that sits in front of your actual content, and throw out subsequent requests from a UA that fails the check. Of course identifying the UA consistently is yet another challenge.. if the thing handles cookies properly, you can use a cookie. You could try going by IP, that that's dicey in various ways. Etc., etc., yada, yada.

All in all, there's a constant arms race going on between the companies that want to block bots / crawlers, and the people who want to crawl/scrape content. The techniques on both sides are constantly evolving.

250000000 / (3600 * 40) / 20 = 86

86 pages per machine is not very performant at all, just very simple parallelism will do.

The units of your equation are pages per second per machine, but I agree. Reading briefly it seems like he only used 141 threads per machine to do the actual crawling. This likely could be pushed an order of magnitude further (or even more!), especially by using green threads. It does seem like he was running up against CPU constraints and network issues soon after that, but also this was written in 2012.

Ca. 2007 I was running a RSS feed fetcher that was running 400 processes (not threads) in parallel on dual CPU Xeon's. The only reason we didn't push it higher was that 400 in parallel was more than enough for our use at the time. Of course of those 400 some were always waiting on IO. I never measured how many feeds we did on average every second.

I've used common crawl http://commoncrawl.org/.. also referenced in the article. This is great work and I would love if they could get a daily crawl going.

Did you use a cuckoo filter instead of a bloom filter to manage dynamic additions?

This is an amazing read. But i think for people to implement, it should be how to crawl billion pages on cheaper boxes, may be spanning to a week or something.

This is a post from 2012.

250 million is less characters than quarter billion.

250 million isn't cool.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact