

Common Crawl - namin
http://commoncrawl.org/

======
breckinloggins
If the implantation is semantically rich and complete enough, this might
really help those who want to tackle the first of pg's "ambitious startup"
ideas.

If I have an idea for a search product, competing with Google isn't really the
first roadblock my brain puts up. It's more like "sure brain, sounds swell;
now, how do you propose to populate this engine of yours?"

------
mkl
Previous discussion: <http://news.ycombinator.com/item?id=3209690>

------
pooyak
This blog post:
[http://matpalm.com/blog/2012/01/01/common_crawl_collocations...](http://matpalm.com/blog/2012/01/01/common_crawl_collocations/)
mentions that commoncrawl's data is last updated September 2010. Does anyone
know if that's still the case?

------
Arelius
I imagine that building the index is computationally, a comparable problem to
crawling the websites themselves. Does anyone have any data on if this is
actually a large win?

~~~
boyter
I too would like to know the answer to this. In my experience building indexes
and crawlers the crawler is always easier to write.

The reason being initially you just need a lot of pages to work with. Anyone
can write a simple

while(links) { get link }

crawler and just let it run for months on end without too many issues. Heck
just some xargs and wget will get you by for a long time.

By the time you have your search indexing and spitting out results you are
going to need your own dedicated crawler anyway to ensure you are crawling the
pages you have identified as being most interesting.

I imagine this data set is not useful for those doing a search engine, but
those wanting to calculate statistics on snapshots of the web, such as pages
using jquery and the like.

