
Ask HN: Has anyone else noticed Stack Overflow clones in Google search results? - rchrd2
Has anyone else noticed Stack Overflow clones in Google search results? They come up frequently for me. I can&#x27;t help but wonder who&#x27;s behind these. It can&#x27;t be hurting Stack Overflow&#x27;s SEO.<p>So far I have saved 5 different domains, and it looks like 2 have vanished.<p>- [dead] http:&#x2F;&#x2F;www.codeitive.com&#x2F;0izVUjjXVP&#x2F;selective-foreign-key-usage-in-django-maybe-with-limitchoicesto-argument.html<p>- [dead] http:&#x2F;&#x2F;www.codedisqus.com&#x2F;0QmqWVgjgg&#x2F;hide-label-in-django-admin-fieldset-readonly-field.html<p>- http:&#x2F;&#x2F;w3facility.org&#x2F;question&#x2F;image-servingurl-and-google-storage-blobkey-not-working-on-development-server&#x2F;<p>- http:&#x2F;&#x2F;goobbe.com&#x2F;questions&#x2F;3109325&#x2F;how-can-i-disable-a-third-party-api-when-executing-django-unit-tests<p>- http:&#x2F;&#x2F;www.ciiycode.com&#x2F;0HyN6eQxgjXP&#x2F;django-admin-inline-popups<p>I actually wrote Stack Overflow support about this in April 2015, but so far nothing has changed. Here&#x27;s the thread:<p>Me: &quot;Hello,
 There a lots of spam results on Google. As a web developer, I am frequently googling for how to resolve some programming issue. Often, I get these spoofing sites that link to stack overflow.<p>I would suggest adding a stricter robots.txt or perhaps blocking some of these bots that are scraping your site.<p>Here is an example.
http:&#x2F;&#x2F;www.ciiycode.com&#x2F;0HyN6eQxgjXP&#x2F;django-admin-inline-popups<p>Thank you.&quot;<p>Response: &quot;Hello,<p>Thank you for reporting this content. I&#x27;ve passed the information along to the person at our company who handles such issues. It&#x27;s the diligence of users like you that helps us stay valuable!<p>Please note, bringing these sites into compliance (or getting them to no longer serve our content) is often a long and arduous process. You may not see immediate results. However, rest assured that we&#x27;re working on it.<p>Thank you again,
Stack Exchange Team&quot;<p>Thoughts?
======
Animats
Google doesn't have a good way to establish provenance, and has trouble
distinguishing copies from originals. It's a common complaint of blog
operators that some bigger blog copied their stuff and got a higher ranking on
Google.

Google could check when it saw something, but that won't work against fast
scrapers. For that, you need trusted timestamps.

One solution to this would be to have a few time-stamping services. You send
in a string, probably a hash, and it adds a timestamp, signs it, and sends
back a signed result. Then provide a WordPress plug-in to use this service,
hashing and time-stamping each blog entry, and putting the result in the HTML
in some standard way. (Perhaps <span signed-provenance-timestamp-hash="xxxxx">
blog entry </span>). A few mutually mistrustful services for that would help;
blogs with serious forgery problems could use multiple time-stamping services.

Search engines then need to look at timestamps as a rating indicator. If two
results are very similar, the earliest one wins.

~~~
Osmium
Since establishing provenance is such a big problem for Google, perhaps it
might be a good idea for Google to offer a time-stamping service itself?

~~~
dsjoerg
It's not directly a problem for Google — it's primarily a problem for sites
that create original content.

~~~
ezequiel-garzon
Not quite. The quality as a search engine could improve dramatically by
letting true authors let Google know the content is coming before it's been
available anywhere on the web.

------
jrockway
Stack Exchange makes all the content available as a downloadable file, so they
must be expecting clones.

[https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)

They then have a bunch of licensing terms if you reuse the content, which must
be what they're referring to by "bringing these sites into compliance is often
a long and arduous process".

~~~
ivanca
They even have a tool/site to query the data:
[https://data.stackexchange.com/stackoverflow/query/new](https://data.stackexchange.com/stackoverflow/query/new)

------
plorkyeran
They've been around for years. A while back a few were sometimes beating SO in
google results, but google eventually fixed that. Everything on SO is cc-by-sa
licensed, so as long as the source is attributed (and it is on some of the
sites), it's legal. The motivation is simple: the clone spam sites have ads on
them. They're not actively trying to attack SO; they're just leeching value.

~~~
rchrd2
I guess it's a win-win situation. Stack Overflow's ranking goes up, because of
all the sites referencing it, and the spammers make money.

But as a consequence, the users have to sift through spam results.

~~~
ibejoeb
I don't think so. Those sites redisplay SO content, but they're not linking to
SO. They're actually trying to out-rank them, up their own traffic, and drive
up their ad revenue.

~~~
rhizome
Sometimes they link back, sometimes they don't.

------
kjhughes
See "A site (or scraper) is copying content from Stack Exchange. What do I
do?"

[http://meta.stackexchange.com/q/200177/234215](http://meta.stackexchange.com/q/200177/234215)

------
jeffmould
Sort of related, but in the last year it seems to me that Google's results in
general have been increasingly getting worse. Now it seems that at least two
or three spam results are always present in the first page of results. Most of
these results also happen to be duplicate content of bigger sites though. I
have tried reporting on numerous occasions to Google, but just usual "we will
investigate" response and never a word back.

~~~
Bjartr
In the cases where the information you're looking for doesn't need to be
recent (in my case exercises for a particular sport) you can change the date
range of your search to e.g. only include results from prior to 2005.
Obviously this isn't an option for recent information, but I've found it
useful on occasion to filter out blogspam.

------
dredmorbius
This is common across numerous contexts. I've found what appear to be bots
Tweeting my reddit and HN posts (I don't mind), several Reddit clones of
various levels of sniffitude, Google+ content harvesting, and some Diaspora
syndication (to be expected), again, of various levels of sniffitude.

 _To the extent that this simply distributes data around, doesn 't claim it
for its own, and credits source, I'm OK with this._ Better even if it follows
site-specific licensing. Among my visions for the Web would be content
syndication where such schemes would actually _directly_ benefit authors and
creators, _regardless_ of where their content is served.

------
fiatjaf
I don't understand why Google can't figure it out and remove these clones.
They can do much harder things. Why couldn't them outrank sites based on equal
text content or -- much better -- huge presence of ads.

~~~
TeMPOraL
I suppose it's mostly a social/legal problem. Google could get rid of 90% of
those scrappers by basically hardcoding a preference for StackOverflow if it
has the same content than another site. Same with sites like Wikipedia. But
then obviously people will call foul play (probably scrappers and SEO people
will be the loudest to complain). So whatever solution Google makes has to be
general enough people won't call it unfair, and that makes the problem much
more difficult.

------
inguinalhernia
Scraper spam exists for virtually every website that is even somewhat popular,
there is almost nothing site owners can do about it but rely on Google, Bing,
DuckDuckGo, etc, to figure out the originator. Google _should_ be able to
figure it out and rank the true origin source appropriately, and usually
Google is pretty effective at that. But sometimes they're not.

Perhaps what I find ironic is that many of the user "answers" on StackOverflow
are basically clone spam themselves, copy/pasted from other websites by some
user of the site, usually without sourcing the origin. I have personally found
my own unique solutions and code copied verbatim and pasted to answers on the
StackExchange network multiple times, outranking my original work without a
reference to the origination of course, and I'm sure others have experienced
something similar. Perhaps that's what you get with a user generated site,
maybe Wikipedia experiences something similar.

Related, some of you may recall a few years back, that StackOverflow basically
complained to Google in a public fashion about not ranking well enough and got
a boost from them, whereas obviously the average Joe and an average website
has no such option nor recourse. Here was the discussion on HN:
[https://news.ycombinator.com/item?id=2152286](https://news.ycombinator.com/item?id=2152286)

------
Phil_Latio
When I look at w3facility.org, it seems like Googles algorithms do not
properly handle the case when a scrape-site provides a source-link to the
original content.

Google recommends the latter to protect against duplicate content penalties
when you use some external content to enrich your site (for example a short
section from Wikipedia, imdb actor info, etc).

------
ConceptJunkie
I haven't seen this yet, and I do searches pretty frequently that result in
StackOverflow.com hits... and I make a point of choosing them first, because
they are usually the best results.

However, a few days ago, I did see a single hit for expertsexchange.com for
first time in years, at least that I noticed. Back when Google used to let you
blacklist domains from your search results, before StackExchange was ever
around, or when it was still very new, I used to block them, and eventually
I'd assumed they'd gone away, but maybe not.

I hope StackOverflow and/or Google are able to do something to put the kibbosh
on this kind of thing, because despite the complains SO is still a huge and
valuable resource.

------
aaron695
> I would suggest adding a stricter robots.txt or perhaps blocking some of
> these bots that are scraping your site.

The only people who care about robots.txt are some of the big companies. Even
Baidu ignores it (As they can, it's purely there as etiquette)

Blocking bots is hard.

------
douche
I see this happen a lot also with the MSDN forums. The interesting thing there
is that some of the mirror sites are still carrying topics that have been
deleted or otherwise disappeared from the real MSDN forums. More than once, in
the obscure subset of Microsoft tooling that I work in, the only hits that are
still alive are on somewhat suspicious looking .ru sites, so in some sense, I
am glad these sites do exist - otherwise, I'd be completely SOL trying to
figure out why the badly-documented API I'm relying on is barfing up an opaque
HRESULT.

------
onion2k
It's not just StackEnchange. I noticed the other day there's a Twitter account
and website called "@explodingAds" that tweets HN user comments and mirrors
them on it's website. I imagine taking content verbatim is just a quick way to
build a corpus of search indexed pages that generate page views and ad
impressions.

------
nmbdesign
Have been noticing this for the last few months too, very weird.

------
volaski
I don't understand why StackOverflow allows this, yeah creative commons is
cool and all but it IS NOT COOL for actual users. I get so annoyed every time
i search for something on Google and it leads to an its clone site. It's not
like StackOverflow has better search than Google (which is ridiculous). I
still have to search on Google if I want quality search result instead of
StackOverflow. With more power comes more responsibility. StackOverflow's
policy feels too irresponsible in this regard

~~~
JasonPunyon
We use that license because it protects the content from us. No matter who
comes along to run Stack Overflow in the future, Stack Overflow can't do
something like put up a paywall and lock it up. Someone else'll just be able
to host a copy.

~~~
volaski
As an end user, I don't care about that at all. If StackOverflow does
something like put up a paywall and lock it up, then it will die off and some
other site will arise that will replace it, just like some other sites that
came before StackOverflow which put up a paywall and faded away. Also,
StackOverflow can always change the policy if they want (which probably won't
happen for the reason I mentioned), so the license as an excuse doesn't really
make sense to me. Especially when it comes at a cost of horrible user
experience. Lastly it doesn't seem like StackOverflow is doing much to improve
search on the site itself and that's what makes this even worse. I wouldn't be
complaining if I could find more relevant StackOverflow results on
StackOverflow than searching on Google. How is it that I can find more
relevant results on a generic search engine than the site where the contents
came from?

~~~
volaski
Wow i'm getting downvoted like crazy. I don't think I said anything that's not
factual. At least explain why you think I am wrong if you're gonna downvote.
To be clear, I love Stackoverflow and I don't know what the world would have
been like if it wasn't around, but I do think there are things that are broken
and I just mentioned them. Am I supposed to keep quiet because that's how it's
been?

~~~
ciupicri
> If StackOverflow does something like put up a paywall and lock it up, then
> it will die off and some other site will arise that will replace it

As others have already said that possible thanks to their liberal license, not
despite of it, so you're wrong, hence the downvotes (though I haven't voted in
any way).

~~~
volaski
As a thought experiment, do you think if StackOverflow changed their license
tomorrow (hypothetically because of the spam problem) but assured all its
users that they will never mess with them, all users will leave simply because
of the license?

------
wyclif
There's also a lot of Trello clones.

