Google, MSR, and Yahoo! have an edge on research over universities because of the large amount of data they collect from the users; all the other institutions are left with either small-size benchmark datasets or synthetic data, which are usually not representative of the actual usage scenarios. I myself had to synthetize a query log from the Wikipedia request logs to test some of my data structures on large-scale data.
I expect to see a huge number of papers which will use these data in their experiments in the immediate future. Thanks, Blekko!
As far as I can tell, this is the best resource for publicly available search engine query logs.
I don't intend to downplay the contribution, having a large collection of spam/porn classified web docs is still a very nice thing to have for researchers.