
Blekko donates search data to Common Crawl - dwynings
http://blog.blekko.com/2012/12/17/common-crawl-donation/
======
hosay123
The point worth note is that it's not an archive of downloaded pages, it's
data for the Blekko equivalent of PageRank (i.e. computed relationships, not
just the pages). To generate this independently would not only require access
to a large crawl, but also robust code and most probably a large cluster to
compute it in reasonable time, not to mention legal advice to avoid stepping
on Google (et al) patents.

If only certain other companies weren't so precious about their publicly
derived data. Fabulous donation.

~~~
jacquesm
> not to mention legal advice to avoid stepping on Google (et al) patents.

You may be right, you may not be right, but if this is the equivalent of
PageRanked data then you may not be in the clear to use this as is. After all
if 'PageRank' went into producing it buying it and using it makes you the
beneficiary of patent infringement.

Personally I'd say so-su-mi, but I still think that it should be noted that
the fact that someone else did the infringing does not put you automatically
in the clear when using the end product.

[http://en.wikipedia.org/wiki/Patent_infringement_under_Unite...](http://en.wikipedia.org/wiki/Patent_infringement_under_United_States_law)

Section indirect infringement.

~~~
AznHisoka
No, this is not the equivalent of PageRank. it could be an approximation
however, and they are probably calling it a different name.

If that's not legal, then SEOMoz would've been sued a long long time ago (see
OpenSiteExplorer, Page Authority, Domain Authority, etc)

~~~
greglindahl
blekko doesn't compute PageRank, and we don't compute anything similar to it,
either. It's highly gamed and less useful than you might think. (The academic
equivalent of PageRank for research papers is highly gamed, too, by citation
clubs...)

By the way, the original PageRank patent is owned and licensed by Stanford
University, not by Google.

~~~
hosay123
Can you tell us a little more about what the 'ranking metadata' is, as there's
not much to go on from the announcement. It's also not clear whether the data
is available only for Common Crawl's operational purposes, or whether it's
intended to become an integral part of the public data set.

~~~
greglindahl
The ranking metadata consists of: domain ranks, url ranks, and booleans for
whether blekko considers the domain or url to be webspam or porn. This list
will expand in the future.

The data is currently available for Common Crawl's operational purposes, and
is eventually going to be part of Common Crawl's public dataset. We're
currently ironing out a useful format for making it efficiently accessible,
compatible with some other metadata which Common Crawl is planning on making
available.

------
graue
This is great, but I'm confused by the part about avoiding porn:

> _Common Crawl will use blekko’s metadata to improve its crawl quality, while
> avoiding webspam, porn, and the influence of excessive SEO (search engine
> optimization)_

Why avoid porn? Millions of people deliberately search for porn on the
internet every day. It's hardly less worthy of crawling than any other
content. The next sentence goes on to suggest that porn is not "useful to
humans", which is obviously false.

If Common Crawl is indeed filtering out content they determine to be
pornographic, I hope they are taking care not to also remove information on
sexual and relationship health and LGBT rights, which are often collateral
damage of porn-blocking systems. And it would be nice to see an open
acknowledgement that filtering is going on - I couldn't find any references to
this at commoncrawl.org.

~~~
randomstring
I work at Blekko and am the primary engineer working on our porn tagger. We
include LGBT, reproductive/sexual health, breast cancer, bands like "Pussycat
Riot," etc in our training set to make sure these sites do not get hidden from
our search results.

We do not have anything against porn. However, when people are not searching
for porn, showing them porn results makes for a bad search experience. So
identifying porn, and only showing porn on relevant porn results is vitally
important to search quality.

~~~
jacquesm
So tag it but include it.

Your answer is a bit at odds with
<http://news.ycombinator.com/item?id=4933437>

~~~
ChuckMcM
Jacques, the porn is there, its just identified as such. Whether or not it is
included in results is a function on the query.

One of the funny things about language is that there is always a 'pun' or an
innuendo which can trigger a hit on a porn site, however if most of what
you're looking for isn't porn then the web site has to assume you are _not_
looking for porn and avoid some NSFW link from surfacing into your search
results. You could always explicitly ask for it with /porn but then that is a
clear signal of what you are looking for.

Part of the crawl data includes an indication as to whether or not the ranker
thought the document was 'porn' or 'not porn', so if you're selecting things
to return you can ignore that bit, mixing porn with non-porn when someone
searches for 'beavers' you get a wider variety of results than you would if
you were assuming you meant the furry critters which chew on trees or sports
teams and limiting results to those documents.

~~~
jacquesm
That's actually really useful.

Having it there but tagged is halfway towards being able to use it to filter
them out. Not having it means that when you merge it with another set that
you're not going to be able to remove the porn.

And it also allows you to use it as a training set for classifiers.

~~~
ChuckMcM
"And it also allows you to use it as a training set for classifiers."

One could imagine a project on Common Crawl which auto-generated a list of
slang terms for porny things by creating a list of n-grams from the words used
in documents tagged as porn.

------
ot
This is going to be invaluable for information retrieval researchers.

Google, MSR, and Yahoo! have an edge on research over universities because of
the large amount of data they collect from the users; all the other
institutions are left with either small-size benchmark datasets or synthetic
data, which are usually not representative of the actual usage scenarios. I
myself had to synthetize a query log from the Wikipedia request logs to test
some of my data structures on large-scale data.

I expect to see a huge number of papers which will use these data in their
experiments in the immediate future. Thanks, Blekko!

~~~
robrenaud
As far as I can tell, this contribution from Blekko doesn't have any user
data/queries in it.

As far as I can tell, this is the best resource for publicly available search
engine query logs.

<http://jeffhuang.com/search_query_logs.html>

I don't intend to downplay the contribution, having a large collection of
spam/porn classified web docs is still a very nice thing to have for
researchers.

~~~
greglindahl
This is our first donation. We have a lot more we plan on giving, but for user
queries, for example, the privacy issues are a lot more difficult to work
through. We have no interest in being the next privacy scandal.

------
LisaG
I am part of Common Crawl and I just wanted to say that we are super excited
about blekko's donation! This is yet another demonstration how much blekko
values openness and transparency.

------
greglindahl
If you'd like to see some examples of what you can do with Common Crawl data,
here are the winning projects from a code contest held last September:

[http://commoncrawl.org/announcing-the-winners-of-the-code-
co...](http://commoncrawl.org/announcing-the-winners-of-the-code-contest/)

Some code libraries for using Common Crawl data:

<https://github.com/commoncrawl/>

Some clues for getting started:

<http://commoncrawl.org/get-started/>

~~~
chiph
This is a pretty cool thing that you're doing, Greg.

------
mtgx
Last I checked Blekko was pretty good, better than DuckDuckGo in search
relevancy I thought, but that was like a year ago.

~~~
NonEUCitizen
Does Blekko also NOT track you, like DuckDuckGo ?

~~~
sp332
The terms seem pretty reasonable, which means they are not as extreme as
DDG's. <https://blekko.com/about/privacy-policy>

------
nell
I was at the open data meetup in Mozilla where Common Crawl presented and
Blekko's CTO was present. Little did I know that great stuff like this was in
the making.

------
ssalevan
Common Crawl has a really neat mission, as there isn't a whole lot of free and
open data out in the world right now and they're trying to change that. With
this donation it looks like their commons will be augmented with some great
stuff and that can only mean awesome things.

------
kjackson2012
I've tried Blekko a couple of times, and thought they were okay, but I don't
find them so much better than Google to make me switch.

I'm curious, how Blekko can stay in business? Do they get enough traffic and
revenue from ads, etc, to maintain some sort of positive cash flow, or are
they simply burning through cash from investors?

~~~
frederi
For me, blekko doesn't have to be "so much better" to make me use it over
Google, it just has to be just as good or slightly better because I respect
and support blekko's philosophy. If two products were of equal quality,
wouldn't you rather use the one made by a company that shares your values?

~~~
LisaG
Strongly agree! blekko's Bill of Rights is a great expression of their values
and of why we should all be using blekko.

blekko Bill of Rights

1\. Search shall be open

2\. Search results shall involve people

3\. Ranking data shall not be kept secret

4\. Web data shall be readily available

5\. There is no one-size-fits-all for search

6\. Advanced search shall be accessible

7\. Search engine tools shall be open to all

8\. Search & community go hand-in-hand

9\. Spam does not belong in search results

10\. Privacy of searchers shall not be violated

~~~
boyter
Although to be fair, you have to pay to get the SEO/Ranking data now. I'm cool
with that if it keeps the service open but it should be pointed out.

On a related note www.procog.com has a totally open algorithm.

------
chuhnk
I wondered how Google's foothold in search would ever be overcome. I think
this might just be the start of something.

~~~
iroy
Google has had years to tweak and tune things. I think it is practically
impossible for anyone to match them, let alone surpass them in index quality.

Google needs to open its index and create a search market place. There are
millions of domain/location specific apps that can be built around that index.

