I alluded to that in http://news.ycombinator.com/item?id=2218627 . People feel comfortable with Google removing blatant spam: hidden text, cloaking, sneaky JavaScript redirects, etc. People tend to feel less comfortable if they feel like Google is making an editorial decision.
If we get a good signal from this extension, or from offering block links in Google's search results, then it's much more similar Gmail's spam algorithm, where an email is labelled as spam partially because a lot of users say it is, rather than because of some editorial decision on our part.
Matt, before we get to the point where potentially controversial editorial decisions will have to be made, I would imagine there are things that could be done automatically and uncontroversially.
For example, sometimes we see copies rank higher than originals. Why does that happen? Google know where they first saw a particular piece of content, don’t they? Why don’t they use that as a heavy ranking factor?
Consider this: I make a blog post on my relatively new blog examplesite.com. Techcrunch picks up on the article and immediately reposts it on their site. Which do you recon would be the first to get indexed?
Now which is the original from Google's point of view? Relatively smaller blogs can take significantly longer to index then sites that have massive amounts of content moving about daily.
I have thought about that and I cannot believe it is really a problem.
My small negligible personal website notifies Google, Bing and Yahoo immediately and automatically as soon as I publish something new. It also publishes a feed. Even if the content is picked up and republished right away by a site that is indexed every minute, it should be possible to determine correctly the original publisher.
In some cases, I can think of more ways to determine the original publisher. And certainly Google can think of even more.
I publish an article at only-original-content.com. The article has some images that are served from only-original-content.com/images. Now only-copied-content.com takes my original article and republishes it. Since only-copied-content simply copied the HTML, the images are still served from only-original-content.com/images.
In that case it should be simple to determine who is the original publisher. Of course, only-original-content.com could simply be a CDN that only-copied-content.com uses for its static resources, but, again, it should be easy to determine whether that is the case.
demetris, that's exactly what we improved with a recent algorithm change: http://www.mattcutts.com/blog/algorithm-change-launched/ . As you point out, that was a more straightforward change, and that's why we were able to launch that one first.
That said, if you wanted to share some examples where you're seeing copies rank higher than originals I'm happy to pass that on to the right folks. In fact, some of the right folks are already on this thread. :)
Hmm. That could be if Google Groups has changed the url structure, which could make crawling it harder. Or because USENET/mailing lists don't always have a centralized/canonical location on the web, which makes dupe content more of a potential issue. It's not the usual "Website X copied my website" scenario.
Getting on a tangent here, but Google has a hard time crawling Google Groups? Have you tried emailing support@google.com? Just kidding but in all seriousness how is it that a Google property has bad SEO?
You already have the SafeSearch filter that can be toggled on and off to show you different search results. Why not an Editorial filter as well (perhaps disabled by default)?
I understand the fine line you have to walk, and I'm glad to see that you guys take it as seriously as you do.
Personally, I'm glad that you're putting this out as a user-controlled thing. I like the fact I'm able to get rid of results that aren't necessarily spam or SEO'ed garbage, but I where I still know that I never want to see results from that site again.
However, if it is built in to the general results, could you also add a metric to Webmaster Tools showing the frequency with which your domain is reported. It would also be good if the blocking could have timeout period so that sites can be given a chance to improve their behavior, rather than just deleting that domain forever.
If we get a good signal from this extension, or from offering block links in Google's search results, then it's much more similar Gmail's spam algorithm, where an email is labelled as spam partially because a lot of users say it is, rather than because of some editorial decision on our part.