The point worth note is that it's not an archive of downloaded pages, it's data for the Blekko equivalent of PageRank (i.e. computed relationships, not just the pages). To generate this independently would not only require access to a large crawl, but also robust code and most probably a large cluster to compute it in reasonable time, not to mention legal advice to avoid stepping on Google (et al) patents.
If only certain other companies weren't so precious about their publicly derived data. Fabulous donation.
> not to mention legal advice to avoid stepping on Google (et al) patents.
You may be right, you may not be right, but if this is the equivalent of PageRanked data then you may not be in the clear to use this as is. After all if 'PageRank' went into producing it buying it and using it makes you the beneficiary of patent infringement.
Personally I'd say so-su-mi, but I still think that it should be noted that the fact that someone else did the infringing does not put you automatically in the clear when using the end product.
blekko doesn't compute PageRank, and we don't compute anything similar to it, either. It's highly gamed and less useful than you might think. (The academic equivalent of PageRank for research papers is highly gamed, too, by citation clubs...)
By the way, the original PageRank patent is owned and licensed by Stanford University, not by Google.
Can you tell us a little more about what the 'ranking metadata' is, as there's not much to go on from the announcement. It's also not clear whether the data is available only for Common Crawl's operational purposes, or whether it's intended to become an integral part of the public data set.
The ranking metadata consists of: domain ranks, url ranks, and booleans for whether blekko considers the domain or url to be webspam or porn. This list will expand in the future.
The data is currently available for Common Crawl's operational purposes, and is eventually going to be part of Common Crawl's public dataset. We're currently ironing out a useful format for making it efficiently accessible, compatible with some other metadata which Common Crawl is planning on making available.