If only certain other companies weren't so precious about their publicly derived data. Fabulous donation.
Agreed. I look forward to the day when the Open Data movement is as established as the Open Source movement.
You may be right, you may not be right, but if this is the equivalent of PageRanked data then you may not be in the clear to use this as is. After all if 'PageRank' went into producing it buying it and using it makes you the beneficiary of patent infringement.
Personally I'd say so-su-mi, but I still think that it should be noted that the fact that someone else did the infringing does not put you automatically in the clear when using the end product.
Section indirect infringement.
If that's not legal, then SEOMoz would've been sued a long long time ago (see OpenSiteExplorer, Page Authority, Domain Authority, etc)
By the way, the original PageRank patent is owned and licensed by Stanford University, not by Google.
The data is currently available for Common Crawl's operational purposes, and is eventually going to be part of Common Crawl's public dataset. We're currently ironing out a useful format for making it efficiently accessible, compatible with some other metadata which Common Crawl is planning on making available.
> Common Crawl will use blekko’s metadata to improve its crawl quality, while avoiding webspam, porn, and the influence of excessive SEO (search engine optimization)
Why avoid porn? Millions of people deliberately search for porn on the internet every day. It's hardly less worthy of crawling than any other content. The next sentence goes on to suggest that porn is not "useful to humans", which is obviously false.
If Common Crawl is indeed filtering out content they determine to be pornographic, I hope they are taking care not to also remove information on sexual and relationship health and LGBT rights, which are often collateral damage of porn-blocking systems. And it would be nice to see an open acknowledgement that filtering is going on - I couldn't find any references to this at commoncrawl.org.
We do not have anything against porn. However, when people are not searching for porn, showing them porn results makes for a bad search experience. So identifying porn, and only showing porn on relevant porn results is vitally important to search quality.
Your answer is a bit at odds with http://news.ycombinator.com/item?id=4933437
One of the funny things about language is that there is always a 'pun' or an innuendo which can trigger a hit on a porn site, however if most of what you're looking for isn't porn then the web site has to assume you are not looking for porn and avoid some NSFW link from surfacing into your search results. You could always explicitly ask for it with /porn but then that is a clear signal of what you are looking for.
Part of the crawl data includes an indication as to whether or not the ranker thought the document was 'porn' or 'not porn', so if you're selecting things to return you can ignore that bit, mixing porn with non-porn when someone searches for 'beavers' you get a wider variety of results than you would if you were assuming you meant the furry critters which chew on trees or sports teams and limiting results to those documents.
Having it there but tagged is halfway towards being able to use it to filter them out. Not having it means that when you merge it with another set that you're not going to be able to remove the porn.
And it also allows you to use it as a training set for classifiers.
One could imagine a project on Common Crawl which auto-generated a list of slang terms for porny things by creating a list of n-grams from the words used in documents tagged as porn.
I really appreciate your mention of LGBT and sexual health sites being collateral damage - we need to draw more attention to that problem. I would love to see someone work with Common Crawl to improve methods of distinguishing.
> The next sentence goes on to suggest that porn is not
> "useful to humans", which is obviously false.
Google, MSR, and Yahoo! have an edge on research over universities because of the large amount of data they collect from the users; all the other institutions are left with either small-size benchmark datasets or synthetic data, which are usually not representative of the actual usage scenarios. I myself had to synthetize a query log from the Wikipedia request logs to test some of my data structures on large-scale data.
I expect to see a huge number of papers which will use these data in their experiments in the immediate future. Thanks, Blekko!
As far as I can tell, this is the best resource for publicly available search engine query logs.
I don't intend to downplay the contribution, having a large collection of spam/porn classified web docs is still a very nice thing to have for researchers.
Some code libraries for using Common Crawl data:
Some clues for getting started:
I'm curious, how Blekko can stay in business? Do they get enough traffic and revenue from ads, etc, to maintain some sort of positive cash flow, or are they simply burning through cash from investors?
blekko Bill of Rights
1. Search shall be open
2. Search results shall involve people
3. Ranking data shall not be kept secret
4. Web data shall be readily available
5. There is no one-size-fits-all for search
6. Advanced search shall be accessible
7. Search engine tools shall be open to all
8. Search & community go hand-in-hand
9. Spam does not belong in search results
10. Privacy of searchers shall not be violated
On a related note www.procog.com has a totally open algorithm.
Google needs to open its index and create a search market place. There are millions of domain/location specific apps that can be built around that index.