This is going to be invaluable for information retrieval researchers.

Google, MSR, and Yahoo! have an edge on research over universities because of the large amount of data they collect from the users; all the other institutions are left with either small-size benchmark datasets or synthetic data, which are usually not representative of the actual usage scenarios. I myself had to synthetize a query log from the Wikipedia request logs to test some of my data structures on large-scale data.

I expect to see a huge number of papers which will use these data in their experiments in the immediate future. Thanks, Blekko!

As far as I can tell, this contribution from Blekko doesn't have any user data/queries in it.

As far as I can tell, this is the best resource for publicly available search engine query logs.


I don't intend to downplay the contribution, having a large collection of spam/porn classified web docs is still a very nice thing to have for researchers.

This is our first donation. We have a lot more we plan on giving, but for user queries, for example, the privacy issues are a lot more difficult to work through. We have no interest in being the next privacy scandal.

