
Carnegie-Mellon Public IR/Web Mining Datasets - Anon84
http://boston.lti.cs.cmu.edu/callan/Data/#Web
======
Anon84
I wonder HOW/IF they'll ever release this particular dataset:

    
    
        web08-bst.v1
    
             * Description: A 25 terabyte dataset of about 1 billion web pages crawled in November, 2008. 
             The crawl order was best-first search, using the OPIC metric. The crawl was started from about 
             25 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 
             million page crawl, or ii) were ranked highly by a commercial search engine for one of 16,000 
             sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish,
             Japanese, French, German, Arabic, Portuguese, Korean, and Italian.
             * Creators: J. Callan, M. Hoy, C. Yoo, and L. Zhao.
             * Status: In progress. Expected to be available to other researchers by March, 2009.
    

I know of several research groups and start-ups that wouldn't mind playing
with it.

------
gtani
<http://www.kdnuggets.com/datasets/index.html>

