If we get a good signal from this extension, or from offering block links in Google's search results, then it's much more similar Gmail's spam algorithm, where an email is labelled as spam partially because a lot of users say it is, rather than because of some editorial decision on our part.
For example, sometimes we see copies rank higher than originals. Why does that happen? Google know where they first saw a particular piece of content, don’t they? Why don’t they use that as a heavy ranking factor?
Or am I too far off?
Now which is the original from Google's point of view? Relatively smaller blogs can take significantly longer to index then sites that have massive amounts of content moving about daily.
My small negligible personal website notifies Google, Bing and Yahoo immediately and automatically as soon as I publish something new. It also publishes a feed. Even if the content is picked up and republished right away by a site that is indexed every minute, it should be possible to determine correctly the original publisher.
In some cases, I can think of more ways to determine the original publisher. And certainly Google can think of even more.
Please share them. Thanks.
I publish an article at only-original-content.com. The article has some images that are served from only-original-content.com/images. Now only-copied-content.com takes my original article and republishes it. Since only-copied-content simply copied the HTML, the images are still served from only-original-content.com/images.
In that case it should be simple to determine who is the original publisher. Of course, only-original-content.com could simply be a CDN that only-copied-content.com uses for its static resources, but, again, it should be easy to determine whether that is the case.
That said, if you wanted to share some examples where you're seeing copies rank higher than originals I'm happy to pass that on to the right folks. In fact, some of the right folks are already on this thread. :)
Personally, I'm glad that you're putting this out as a user-controlled thing. I like the fact I'm able to get rid of results that aren't necessarily spam or SEO'ed garbage, but I where I still know that I never want to see results from that site again.
However, if it is built in to the general results, could you also add a metric to Webmaster Tools showing the frequency with which your domain is reported. It would also be good if the blocking could have timeout period so that sites can be given a chance to improve their behavior, rather than just deleting that domain forever.