Hacker News new | past | comments | ask | show | jobs | submit login
Free 5 Billion Page Web Index Now Available from Common Crawl Foundation (readwriteweb.com)
201 points by pooyak on Nov 8, 2011 | hide | past | favorite | 39 comments

I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling. They were far ahead of everyone else back then, but one could conceive of a rag-tag group of companies, institutions, and individuals pooling their resources and getting a crawl about 10% as good. These days, on the externally visible evidence they're probably several orders of magnitude better than everybody else on the planet combined.

Take crawl freshness. If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

Hi I work at commoncrawl. We have spent our time (in 2011) improving our algorithms, and hopefully this effort will start to show real results (with respect to crawl frequency and relevancy) in 2012. But you are right, it is pretty unlikely that our crawl will be able to be fully competitive with the likes of Google etc., multi-billion dollar corporations who dedicate huge amounts of engineering and hardware resources to stay competitive in this field.

It is not "Google etc., multi-billion dollar corporations" it is just Google.

> I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling.

In the 2004 timeframe, Yahoo was crawling about the same number of pages as Google. (More some months.)

> If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

Time from crawl to appearing in search results is a different issue.

Is there a sample dataset?

I think all projects should have sample datasets. It simplifies a lot of things, and in this case stops hundreds of geeks burning through bandwidth before they realize they don't have a clue what they are going to do with the data.

We hear you. Could you define some criteria as to the type and size of sample data you would like to see? We are working on producing more targeted/limited collections, like perhaps all most recently published blog posts etc.

Perhaps two sets, one that's just a few hundred kilobytes that contains a few sample .arc files to test against the format, and then one larger 'training' set that's small enough to test against offline (maybe like 100MB?) but large enough to contain a good sample of the possible content.

Concur with this comment -- it might also help the community provide feedback on structure and ways to segment that data so that there are more directed efforts to consume small parts of the crawl for processing

Although I'm personally all for open distribution of crawl data like this and all of my personal websites are CC-licensed, isn't there something to be said for the copyright status of the pages in the crawl file?

The crawl file presumably contains the contents of websites and so the owners of those websites could assert that Common Crawl Foundation is distributing their work without permission or license.

There are all sorts of republishing/splog 'opportunities' with this crawl data that goes beyond the original expected use.

Surprisingly, I couldn't see anything about this covered in the FAQs

Hi, you can view our terms of use at http://www.commoncrawl.org/about/terms-of-use/full-terms-of-.... We adhere to the robots.txt standard, try to do all our crawling above board, and (strictly personal opinion here) we are definitely not in the business of diminishing or subverting peoples rights with regards to the content they produce. There are many other options available to those who are determined to crawl a site's content, whether the site owner wants them to or not. Our goal is to democratize access to our crawl for the betterment of Web ecosystem as a whole and we believe storing the data on S3 and making it accessible to a wide audience is the right way to accomplish this goal.

I see it in the ToS:


-Violate other people’s rights (IP, proprietary, etc.)

> We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts.

Were there any other reasons to not use Nutch (performance, etc.)?

I'd love to hear more about the stack you're using to perform the crawls. If you don't mind sharing, it would be very interesting to read about the costs involved in gathering this data (how many machines, how long did it take, etc.)

Any plans to open source that as well? In addition to a general lack of open web crawl data freely available, there are precious few open source projects (if any) that produce high quality crawlers able to deal with the modern web.

I'd love to see Gabriel weigh in on this. I wonder if Duck Duck Go will be able to take advantage of this resource?

I love stuff like this effort. The more open data sources, the better for everyone. I'm sure we (DuckDuckGo) will find a way to make use of it :)

This was my first thought, too. I can't seem to hit the resource to check it out but even if the content is "stale" it can be used for a couple different reasons. Initial snapshot of pages, decent starting index on self crawling (instead of reliance on BOSS or Bing), content differentiation, nullifying search bias (if existent)... But I'm not really a search guy so I could be jaded on its importance.

Nova Spivack said that the crawls have been going for several years. There's a good chance that many of the pages in the archive are unacceptably outdated for indexing purposes.

Hi. I work for commoncrawl. We are about to start an improved recrawl and will be doing this more frequently going forward. In the process we will also consolidate our data on S3 to keep it relevant. But, as with any crawl of the Internet, there is lot of noise in there. We spent most of 2011 tweaking the algorithms to improve the freshness and quality of the crawl, and hopefully this work starts to show results in 2012.

I'm not sure whether major portions of their archive are unacceptably outdated.

But I am sure that it would be logic failure to conclude that it must be out of date simply because they've been indexing for several years. With that logic, Google would be further out of date, having indexed for over a decade.

It is a good news for anyone having an eye for a vertical search engine. With your device, total cost of seed data (Assuming about 40TB) comes below one thousand dollars.

One interesting discussion from here: http://www.commoncrawl.org/common-crawl-enters-a-new-phase/ It says the cost of running a hadoop job to scan all 5billon documents is in the order of $100.

Does any one know how does this compare to let say Yahoo BOSS? Is it even comparable?

Hi, I work at commoncrawl, so I will try to answer your question. We store our crawl data on S3 in the form of 100MB compressed archives and there are between 40,000 and 50,000 such files in commoncrawl’s bucket today. The key to scanning such a large set of files efficiently on EC2 is to have your each of your Mappers (assuming you are running Hadoop) open multiple S3 streams in parallel to maintain some desired level of throughput. For example, assuming that you can maintain on average a 1MByte/sec throughput per S3 stream, and you start 10 parallel streams per Mapper, you should be able to sustain a throughput 80 Mbits/sec or 10 MBytes/sec. If you were to run one Mapper per EC2 small instance, and start 100 such instances, this would yield and aggregated throughput of close to 3TB/hour. At that rate, you would need 16 hours to scan 50TB of data, or a total of 1600 machine hours at $.085 per hour, costing you somewhere in the neighborhood of $130.00. Of course, you would then need to add in the cost of running any subsequent aggregation / data consolidation jobs and the cost of storing your final data on S3. So, the $100.00 number is generally in the ballpark but final numbers may vary :-)

As far as comparisons to Yahoo BOSS are concerned, no, we are definitely not comparable to Yahoo BOSS or other such APIs that run on top of an already built (and properly ranked) inverted index of the web. At this stage we only produce bulk snapshots of what we crawl, and we are focusing our engineering resources on improving the frequency and coverage of crawl (the results of which will hopefully start to bear fruit in early 2012). Perhaps at some point in the near future, we can partner with the community to build a rudimentary full-text inverted index of the Web that we can make available in bulk via S3 as well.

Hey ahadrana, I haven't found anything about the page ranks on the website, are they included? Do you know if it is possible to go only trough the metadata of the crawl, say to get the page ranks for a list of pages or do you have to go through the full crawl?

The pagerank and other metadata we compute is not part of the S3 corpus, but we do collect this information and probably will make it available in a separate S3 bucket in Hadoop SequenceFiles format. Be aware that our pagerank will probably not have a high degree of correlation to Google's pagerank number, since their pagerank calculation is going to be a lot more sophisticated than our version.

Does BOSS still exist? I was under the impression that it was defunct.

I was the former GM of Yahoo BOSS (was there from pre-launch through 11/09). BOSS does still exist - http://developer.yahoo.com/search/boss/. It is now a paid API under the umbrella of Yahoo Developer Network. The pricing plan (http://developer.yahoo.com/search/boss/#pricing) is based on query type and volume. Unfortunately there is no self-serve advertising model (meaning if you incorporate Y!/Bing search ads, the service is free). It's important to note though that this is the Bing search index, not the old Yahoo Search index that is effectively shut down. The original BOSS product was based on Yahoo! Search.

From what I have heard BOSS continues to do very well and is pointed at internally as how to turn an API into a real business and product.

One more note, I am now at Factual where we are very happy consumers of the CommonCrawl service.

Yes. With Google no longer providing search result API (not even paid version, the last I checked) people are turning to BOSS/Bing/(anything else?)

custom search API is the search result APi. The cse has a flag for searching the entire internet. http://www.google.com/support/customsearch/bin/answer.py?hl=...

New Common Crawl blog post addressing many of the questions raised here last week. http://www.commoncrawl.org/answers-to-recent-community-quest...

I work at Common Crawl.Thanks for all the interest and the good questions! Lisa

So what is the license for all of this data? It seems murky at best...

Oh nice. I've been doing a lot of crawling myself (http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo...) and I'd love to get my hands on this data. I hope they'll segment their data a bit further.

I personally would LOVE to have a simple list of the domainnames themselves without all of the connections and documents.

Also: Why not just use bittorrent to distribute it?

I imagine they don't use bittorrent because it is both very large (TBs) and changes frequently.

With S3, you could boot up a bunch of Hadoop processes, pull it (without incurring any bandwidth costs I believe), process it and dump out whatever you want.

I initially submitted this post, but then deleted it and resubmitted to the original post on Common Crawl blog: http://news.ycombinator.com/item?id=3208853

I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.

I wonder if crooks will try to exploit this crawl. As a person who has an index of the web like this it has been interesting to see what they look for. SSN's and credit card numbers are common, as are sites running older versions of PHP software or exploitable shopping carts.

It makes it very easy for people to steal vast amounts of your content and republish it on their own sites, with ads all around it.

Many content sites have protections in place to recognize bots by their behavior or use "honeypots" to tell bots apart from human visitors and thus avoid large scale content theft.

Presumably those protections would prevent this bot from collecting data as well?

I don't see any links to download their Hadoop classes..

Sorry, our github repository had some accidental check-ins that we needed to remove. I will share the link to the code shortly.

"Well this has to be a first for a software company"

I'll just leave this here: http://training.fogcreek.com

You sure you commented on the correct article?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact