Lets say on average each site has 10 pages (you have a dozen of huge blogs vs tens of thousands of onepagers), that would put the number at 18 billion pages.
Following that logic would mean the total web is 72 times larger than what was scraped in this test.
So for a mere $41.760 you too can bootstrap your own Google! ;-)
I'm chiming in here since my employer has a few web archives from the IA and some other organizations.
That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.
But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.
There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.
Crawling may be cheap, but you also want to save that data and make it queryable without waiting minutes for the response to a query. That makes it way more expensive.
> dozen of huge blogs vs tens of thousands of onepagers
Try crawling ONE wordpress blog with less than 10 posts and you will be surprised just how many pages there are due to pagination on different filters and sorting options, feeds, media pages, etc.
> So for a mere $41.760 you too can bootstrap your own Google! ;-)
I think the cost of fetching and parsing the data is much less than the cost of building an index and an engine to execute queries against that massive index.
Yeah, the guy ripped himself off. You can crawl it yourself for next to nothing from home. I think everyone's written their own crawler at some point, it's literally web 101.
250.000.000 pages come in at $580
There are 1.8b websites according to http://www.internetlivestats.com/total-number-of-websites/
Lets say on average each site has 10 pages (you have a dozen of huge blogs vs tens of thousands of onepagers), that would put the number at 18 billion pages.
Following that logic would mean the total web is 72 times larger than what was scraped in this test.
So for a mere $41.760 you too can bootstrap your own Google! ;-)