I'm not saying that a clone will never be listed above SO, but it definitely happens less often compared to a several weeks ago.
This happens for more than StackOverflow clones. Mailing lists, Linux man-pages, FAQs, published Linux articles, etc. all have clone pages that are obvious link farms (sometimes they even include ads that attempt to harm my computer) that rank higher than the "official" (or at least less-noisey) pages.
Ideally, I'd like to completely remove domains from result as has been discussed elsewhere on HN. Hopefully this upcoming push for social networking that Google has will reintroduce a better-implemented "SearchWiki" feature...
> I've been tracking how often this happens over the last month.
> it definitely happens less often compared to a several weeks
> I am seeing many, many more clone sites in my search
> results in the last few months
Result #4 at the moment is "AWS Developer Forums: Interactions between S3, EMR and HDFS ..." on http://www.hackzq8search.appspot.com/developer...com/...
What's sublime about this example is that:
1. hackzq8search is clone of AWS's websites amazonwebservices.com, aws.typepad.com, etc
2. hackzq8search is hosted on appspot.com, Google's App Engine domain
3. hackzq8search is over quota, so the site doesn't show any content anyway.
Yet this site was the top search result, beating out the site it was cloning, time and time again on my AWS/EMR-related searches this week.
The one mitigating aspect as that hackzq8search's URL naming scheme is easily decodable -- the hackzq8search URL includes the full URL of the cloned URL, so I can write a Greasemonkey script to extract the proper original URL.
I found a glimmer of optimism in that the site has been slowly fading in SEO-success this week: I complained about https://encrypted.google.com/search?q=aws+s3+security+sox+pc... on Thursday, but on Friday the hackzq8search Search Result was gone from the first search result page.
It's still not hard to slam some AWS-related keywords into Google and get these bogus results, though.
To be clear: the webspam team does reserve the right to take manual action to correct spam problems, and we do. That not only helps Google be responsive, it also improves our algorithms because we get use that data to train better algorithms. With Stack Overflow, I especially wanted to see Google tackle this instance with algorithms first.
Sites that are the victims of content cloning have to be very visible and valuable, so maybe a little manual curating could be relevant.
> the Stack Overflow cloners could just make other websites
Not really? The point is not to tag the clones but to tag the original; everything that is not the original and that has copied content is a clone -- its name, domain or country notwithstanding.
the primary input to search engines comes from web crawlers...the idea of "first" when it comes to duplicated content is already difficult to determine, and (I would guess) it would get much much worse in the inevitable arms race if something like this were implemented.
Because we don't understand what's hard, we think you're not really trying, and then we make up evil reasons to explain that.
I believe if people understood better the difficulties of spam fighting they would be more understanding.
Not necessarily. The rate at which Google refreshes its crawl of a site, and how deep it crawl, depend on how often a site updates and its PageRank numbers. If a scraper site updates more often and has higher PR than the sites it's scraping, Google will be more likely to find the content there than at its source. Identifying the scraper copy as canonical because it was encountered first would be wrong.
But I'd like to say one other thing. Why is Google only doing something about web spam now after people have pointed out how bad things have been getting? Has anybody considered creating a small team to just oversee public perceptions of the search results and try to keep on-top of things in the future?