Hacker News new | past | comments | ask | show | jobs | submit login

For those curious like I was how much it would cost to scrape the entire internet with the method and numbers provided:

250.000.000 pages come in at $580

There are 1.8b websites according to http://www.internetlivestats.com/total-number-of-websites/

Lets say on average each site has 10 pages (you have a dozen of huge blogs vs tens of thousands of onepagers), that would put the number at 18 billion pages.

Following that logic would mean the total web is 72 times larger than what was scraped in this test.

So for a mere $41.760 you too can bootstrap your own Google! ;-)




I'm chiming in here since my employer has a few web archives from the IA and some other organizations.

That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.

But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.

There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.


OT but why does postbank.de have over a million pages? Is there much|anything worth crawling there?


We never checked the details there. You can use https://web.archive.org/details/www.postbank.de and https://web.archive.org/web/*/postbank.de/* if you want to go exploring.

I assume a lot will be incorrect links and automatically generated pages as is often the case.


Your link also says:

> It must be noted that around 75% of websites today are not active, but parked domains or similar.

So actually more like 0.5B websites. Feels quite tiny. Seems most activity online really is behind walled gardens like FB.


You can check abstract statistics here: http://www.businessinsider.com/sandvine-bandwidth-data-shows...

So the majority bandwidth will be videos. But as for unique users - social media and Google does make up the majority.


I get what you're saying about it feeling quite tiny; however, look at it this way: that's like one active website for every 15 people on the planet!


D'oh you're right, should've caught that :-)


Crawling may be cheap, but you also want to save that data and make it queryable without waiting minutes for the response to a query. That makes it way more expensive.


> dozen of huge blogs vs tens of thousands of onepagers

Try crawling ONE wordpress blog with less than 10 posts and you will be surprised just how many pages there are due to pagination on different filters and sorting options, feeds, media pages, etc.


> So for a mere $41.760 you too can bootstrap your own Google! ;-)

I think the cost of fetching and parsing the data is much less than the cost of building an index and an engine to execute queries against that massive index.


Haha I love the math and making it concrete :)


Yeah, the guy ripped himself off. You can crawl it yourself for next to nothing from home. I think everyone's written their own crawler at some point, it's literally web 101.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: