

Free 5 Billion Page Web Index Now Available from Common Crawl Foundation - pooyak
http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php

======
patio11
I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at
this about 5~8 years ago. Since then, Google kind of ran away with the game in
crawling. They were far ahead of everyone else back then, but one could
conceive of a rag-tag group of companies, institutions, and individuals
pooling their resources and getting a crawl about 10% as good. These days, on
the externally visible evidence they're probably several orders of magnitude
better than _everybody else on the planet combined_.

Take crawl freshness. If I publish a new blog post, it gets crawled and added
to the Google index in _seconds_. Other crawling efforts take _weeks_ between
refreshes.

~~~
ahadrana
Hi I work at commoncrawl. We have spent our time (in 2011) improving our
algorithms, and hopefully this effort will start to show real results (with
respect to crawl frequency and relevancy) in 2012. But you are right, it is
pretty unlikely that our crawl will be able to be fully competitive with the
likes of Google etc., multi-billion dollar corporations who dedicate huge
amounts of engineering and hardware resources to stay competitive in this
field.

~~~
funthree
It is not "Google etc., multi-billion dollar corporations" it is just Google.

------
wisty
Is there a sample dataset?

I think all projects should have sample datasets. It simplifies a lot of
things, and in this case stops hundreds of geeks burning through bandwidth
before they realize they don't have a clue what they are going to do with the
data.

~~~
ahadrana
We hear you. Could you define some criteria as to the type and size of sample
data you would like to see? We are working on producing more targeted/limited
collections, like perhaps all most recently published blog posts etc.

~~~
showerst
Perhaps two sets, one that's just a few hundred kilobytes that contains a few
sample .arc files to test against the format, and then one larger 'training'
set that's small enough to test against offline (maybe like 100MB?) but large
enough to contain a good sample of the possible content.

~~~
dcnstrct
Concur with this comment -- it might also help the community provide feedback
on structure and ways to segment that data so that there are more directed
efforts to consume small parts of the crawl for processing

------
dotBen
Although I'm personally all for open distribution of crawl data like this and
all of my personal websites are CC-licensed, isn't there something to be said
for the copyright status of the pages in the crawl file?

The crawl file presumably contains the contents of websites and so the owners
of those websites could assert that Common Crawl Foundation is distributing
their work without permission or license.

There are all sorts of republishing/splog 'opportunities' with this crawl data
that goes beyond the original expected use.

Surprisingly, I couldn't see anything about this covered in the FAQs

~~~
ahadrana
Hi, you can view our terms of use at [http://www.commoncrawl.org/about/terms-
of-use/full-terms-of-...](http://www.commoncrawl.org/about/terms-of-use/full-
terms-of-use/). We adhere to the robots.txt standard, try to do all our
crawling above board, and (strictly personal opinion here) we are definitely
not in the business of diminishing or subverting peoples rights with regards
to the content they produce. There are many other options available to those
who are determined to crawl a site's content, whether the site owner wants
them to or not. Our goal is to democratize access to our crawl for the
betterment of Web ecosystem as a whole and we believe storing the data on S3
and making it accessible to a wide audience is the right way to accomplish
this goal.

------
rgrieselhuber
> We do not use Nutch for the purposes of crawling, but instead utilize a
> custom crawl infrastructure to strictly limit the rate at which we crawl
> individual web hosts.

Were there any other reasons to not use Nutch (performance, etc.)?

I'd love to hear more about the stack you're using to perform the crawls. If
you don't mind sharing, it would be very interesting to read about the costs
involved in gathering this data (how many machines, how long did it take,
etc.)

Any plans to open source that as well? In addition to a general lack of open
web crawl data freely available, there are precious few open source projects
(if any) that produce high quality crawlers able to deal with the modern web.

------
mthoms
I'd love to see Gabriel weigh in on this. I wonder if Duck Duck Go will be
able to take advantage of this resource?

~~~
coderdude
Nova Spivack said that the crawls have been going for several years. There's a
good chance that many of the pages in the archive are unacceptably outdated
for indexing purposes.

~~~
ahadrana
Hi. I work for commoncrawl. We are about to start an improved recrawl and will
be doing this more frequently going forward. In the process we will also
consolidate our data on S3 to keep it relevant. But, as with any crawl of the
Internet, there is lot of noise in there. We spent most of 2011 tweaking the
algorithms to improve the freshness and quality of the crawl, and hopefully
this work starts to show results in 2012.

------
rshm
It is a good news for anyone having an eye for a vertical search engine. With
your device, total cost of seed data (Assuming about 40TB) comes below one
thousand dollars.

------
pooyak
One interesting discussion from here: <http://www.commoncrawl.org/common-
crawl-enters-a-new-phase/> It says the cost of running a hadoop job to scan
all 5billon documents is in the order of $100.

Does any one know how does this compare to let say Yahoo BOSS? Is it even
comparable?

~~~
Aloisius
Does BOSS still exist? I was under the impression that it was defunct.

~~~
nethsix
Yes. With Google no longer providing search result API (not even paid version,
the last I checked) people are turning to BOSS/Bing/(anything else?)

~~~
csulok
custom search API is the search result APi. The cse has a flag for searching
the entire internet.
[http://www.google.com/support/customsearch/bin/answer.py?hl=...](http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=1210656)

------
LisaG
New Common Crawl blog post addressing many of the questions raised here last
week. [http://www.commoncrawl.org/answers-to-recent-community-
quest...](http://www.commoncrawl.org/answers-to-recent-community-questions/)

I work at Common Crawl.Thanks for all the interest and the good questions!
Lisa

------
corbet
So what is the license for all of this data? It seems murky at best...

------
rb2k_
Oh nice. I've been doing a lot of crawling myself ([http://blog.marc-
seeger.de/2010/12/09/my-thesis-building-blo...](http://blog.marc-
seeger.de/2010/12/09/my-thesis-building-blocks-of-a-scalable-webcrawler/)) and
I'd love to get my hands on this data. I hope they'll segment their data a bit
further.

I personally would LOVE to have a simple list of the domainnames themselves
without all of the connections and documents.

Also: Why not just use bittorrent to distribute it?

~~~
Aloisius
I imagine they don't use bittorrent because it is both very large (TBs) and
changes frequently.

With S3, you could boot up a bunch of Hadoop processes, pull it (without
incurring any bandwidth costs I believe), process it and dump out whatever you
want.

------
pablohoffman
I initially submitted this post, but then deleted it and resubmitted to the
original post on Common Crawl blog:
<http://news.ycombinator.com/item?id=3208853>

I now regret since this one got much more attention. I was under the
impression that linking to the original post was more welcomed here HN, but it
seems this is not always the case.

------
ChuckMcM
I wonder if crooks will try to exploit this crawl. As a person who has an
index of the web like this it has been interesting to see what they look for.
SSN's and credit card numbers are common, as are sites running older versions
of PHP software or exploitable shopping carts.

~~~
yaix
It makes it very easy for people to steal vast amounts of your content and
republish it on their own sites, with ads all around it.

Many content sites have protections in place to recognize bots by their
behavior or use "honeypots" to tell bots apart from human visitors and thus
avoid large scale content theft.

~~~
noahc
Presumably those protections would prevent this bot from collecting data as
well?

------
hnwh
I don't see any links to download their Hadoop classes..

~~~
ahadrana
Sorry, our github repository had some accidental check-ins that we needed to
remove. I will share the link to the code shortly.

------
mhp
"Well this has to be a first for a software company"

I'll just leave this here: <http://training.fogcreek.com>

~~~
pragmatic
You sure you commented on the correct article?

