For those curious like I was how much it would cost to scrape the entire interne...

zabjh · on July 5, 2018

I'm chiming in here since my employer has a few web archives from the IA and some other organizations.

That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.

But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.

There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.

logifail · on July 5, 2018

OT but why does postbank.de have over a million pages? Is there much|anything worth crawling there?

zabjh · on July 5, 2018

We never checked the details there. You can use https://web.archive.org/details/www.postbank.de and https://web.archive.org/web/*/postbank.de/* if you want to go exploring.

I assume a lot will be incorrect links and automatically generated pages as is often the case.

101km · on July 5, 2018

Your link also says:

> It must be noted that around 75% of websites today are not active, but parked domains or similar.

So actually more like 0.5B websites. Feels quite tiny. Seems most activity online really is behind walled gardens like FB.

Jacq5 · on July 5, 2018

You can check abstract statistics here: http://www.businessinsider.com/sandvine-bandwidth-data-shows...

So the majority bandwidth will be videos. But as for unique users - social media and Google does make up the majority.

zipppy · on July 9, 2018

I get what you're saying about it feeling quite tiny; however, look at it this way: that's like one active website for every 15 people on the planet!

dstick · on July 5, 2018

D'oh you're right, should've caught that :-)

mgliwka · on July 5, 2018

Crawling may be cheap, but you also want to save that data and make it queryable without waiting minutes for the response to a query. That makes it way more expensive.

anc84 · on July 6, 2018

> dozen of huge blogs vs tens of thousands of onepagers

Try crawling ONE wordpress blog with less than 10 posts and you will be surprised just how many pages there are due to pagination on different filters and sorting options, feeds, media pages, etc.

arendtio · on July 5, 2018

> So for a mere $41.760 you too can bootstrap your own Google! ;-)

I think the cost of fetching and parsing the data is much less than the cost of building an index and an engine to execute queries against that massive index.

q-base · on July 5, 2018

Haha I love the math and making it concrete :)

new_guy · on July 5, 2018

Yeah, the guy ripped himself off. You can crawl it yourself for next to nothing from home. I think everyone's written their own crawler at some point, it's literally web 101.