
Matt Cutts is looking for scraper sites - adamlj
https://twitter.com/mattcutts/status/439122708157435904
======
spindritf
It's a funny quip but it's getting more attention than the important piece of
news it highlights. Google is finally doing something about scraping sites
doing better in search results than original creators. Good.

Many people don't write for money, to put ads on their website, or as part of
some "content marketing" campaign. All they want is a little recognition. A
boost in positioning on the SERP means we will be getting useful stuff at no
cost.

And there are genuine replies there. Ryan Jones[1] even got the scrapers to
confess their sins[2].

[1]
[https://twitter.com/RyanJones/status/439123533349015553](https://twitter.com/RyanJones/status/439123533349015553)

[2]
[https://www.google.com/search?q=%20%22istwfn%22+%22stole+thi...](https://www.google.com/search?q=%20%22istwfn%22+%22stole+this+word+from+noslang%22)

~~~
leephillips
"Google is finally doing something about scraping"

I hope this is genuine and not a disingenuous diversion on Google's part. The
fact that the Huffington Post still ranks very high for trendy searches makes
me wonder.

As usual, follow the money: the scraping sites exist to make money, often
through Google's advertising; Google gets a cut. The original content is often
on sites with no advertising or real traffic, from which Google profits
nothing.

EDIT: To expand on this: Google-search for any hot topic in the news, say the
name of some misbehaving pop star. See the HuffPo result near the top of the
page. Look down to see several results from real newspapers. This is where the
original content can be found. Most of these newspapers are about to die
because they're not making any money. HuffPo investors are filthy rich because
they're gaming the search engines to profit from copy-and-paste.

ANOTHER EDIT: I apologize for my characterization of the Huffington Post. I
was describing, accurately, the nature of that site as it was the last time I
visited it some time before its purchase by AOL three years ago. The HuffPo I
see today is utterly transformed. They use wire services, do plenty of their
own reporting, and many of the links on the front page go directly to other
news sites. They are no longer a copy-and-paste site.

~~~
acjohnson55
Huffington Post isn't a scraper site. Aside from the original content they
produce, they republish blog posts with permission from the authors. If you
have an example of Huffington Post literally cut-and-pasting content from
someone without attribution, please share.

I also assume that by "HuffPo investors" you mean AOL? Huffington Post is a
fully owned subsidiary.

(Disclosure: I consult for Huffington Post)

~~~
leephillips
You are right. Please see my second edit in my comment.

------
VikingCoder
Scrapers lift the full content, wholesale, without attribution.

You may as will just show [http://images.google.com](http://images.google.com)
and complain that it's scraping. Or
[http://news.google.com](http://news.google.com).

In general, do you think Wikipedia gets more traffic because Google exists, or
do you think Google gets more traffic because Wikipedia exists? Meaning, which
affect is larger? I'm pretty sure the answer to this is obvious.

And if more scrapers donated millions to the site they scrape from, the world
would be a much better place.

[http://wikimediafoundation.org/wiki/Press_releases/Wikimedia...](http://wikimediafoundation.org/wiki/Press_releases/Wikimedia_Foundation_announces_$2_million_grant_from_Google)

~~~
danielbarla
That's one way of looking at it, on the other hand, they link to the original
URL, passing traffic back to the original source. Most "scraper" sites take
the content, wrap it in their own similar outer layer, and try to take ad
revenue. E.g. I've seen my own StackOverflow answers copied, word for word, to
a scraper site and presented under a made-up name.

~~~
dangrossman
StackOverflow actually allows this; all their data is Creative Commons
licensed, and they publish the full database dump on the Internet Archive.

[https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)

~~~
jbinto
Do the terms of the license allow for this kind of abuse?

Just because something is CC doesn't mean you can do whatever you want with
it.

~~~
dangrossman
Yes, they do; it's not abuse when you're given explicit permission. CC BY-SA
means you can do whatever you want with it as long as you attribute the source
as specified.

~~~
leephillips
"as long as you attribute the source"

danielbarla said that they presented the material under a false name; this
goes beyond copying and becomes plagiarism, which I can't imagine is an
intended result of the CC license.

~~~
aroch
Is the source 'User X' or 'StackOverflow'? When you reference CC BY-SA code
you don't reference the people who, say, checked it into git but rather the
whole repo.

------
xuki
This is pretty funny
[https://twitter.com/danbarker/status/439125570115223552](https://twitter.com/danbarker/status/439125570115223552)

~~~
nmeofthestate
Hah yep. See also this:
[https://news.ycombinator.com/item?id=7318203](https://news.ycombinator.com/item?id=7318203)

~~~
nissehulth
ARGH! INFINITE RECURSION!

------
jjoonathan
Bah, what would I possibly need with a scraped definition that

1) Hasn't been chunked into 20 pieces of varying grammatical structure which
are automatically matched to corresponding questions

2) Hasn't been subsequently pasted over a slideshow of completely irrelevant
stock photos in bold, white font

3) Isn't accompanied by a grid of ~30 vaguely related questions helpfully
linked to similar pages and tastefully decorated with more irrelevant stock
photos

4) Only occupies ~1.5 rather than 3 or 4 of the front page search results

5) Contains only closely related textual ads rather than a melange of casino,
fast food, and online college banners

6) Has fewer than 25 trustworthy stock faces smiling back at me from any given
scroll position

If this is the best google can do then I don't think wiki.answers.com has
anything to fear.

\------------

Seriously, how the hell does wiki.answers.com manage to pollute half of the
searches I make with their algorithmically generated garbage (multiple times,
at that)?! What kind of SEO catapulted them to the top despite 0 viewer
retention and what surely must be about 0 reputable backlinks? How haven't
they been sent to the 1000th page with manual penalties already? They show up
before wikipedia itself, for crying out loud!

Google, if you aren't going to let users maintain a manual blacklist, you need
to be on top of this kind of thing. It's seriously degrading my search
experience and I suspect I'm not alone. This kind of inattention is the type
of thing that can push even the most inattentive users to change default
search engines.

~~~
robryk
If you use Chrome, there is a "Personal Blacklist" extension that does
essentially what the manual blacklist used to do.

~~~
6cxs2hd6
"If you use Chrome" is Internet Explorer ActiveX controls redux

------
pud
Wikipedia's database is public and used by Google with permission. You can
probably use it for your projects, too.

So this is neither scraping, nor against the rules.

Here are dumps in SQL and XML format:

[http://dumps.wikimedia.org/enwiki/](http://dumps.wikimedia.org/enwiki/)

Ps- Yes the original post was meant to be funny and it was; I do have a sense
of humor. :)

~~~
solve
He's talking about outranking the _true_ original source of the content in
search results. You most certainly cannot create your own site that consists
only of excerpts from Wikipedia, if you wish to remain on Google's search
results. Copyrights are irrelevant to this.

What's bad though, is that Google isn't just lowering the rankings of non-
original content pages now (including any kind of legitimate curation sites.)
They're marking the entire domains of new curation sites as "pure spam" and
de-listing them from Google entirely, and punishing anyone who's linked to
them.

This is having the effect of sending a clear message to developers -- stay far
away from Google's territory of recommending third party content to people, no
matter how you do it.

~~~
pjc50
Could you show an example of such a legitimate new curation site please?

~~~
dreamfactory2
Not new but a legitimate curation site -
[http://hypem.com/](http://hypem.com/)

------
_wmd
Cue damage control explaining the indiscernibly subtle difference between what
Google does and what these evil, spammy scraper sites are doing

~~~
mkr-hn
The difference isn't subtle.

There's this site called "News360" that sends a lot of traffic to my site
every time I post something. It copies the post in full. Apparently it's a
popular app for iPhone and Google Play. This is an aggregator.

Google copies my site so it can send people to my writing. This is a search
engine.

Then there's the legion of sites that copy my stuff and send no traffic even
though they link back. Most of these are scrapers, meaning they're adswill
garbage dumps that get no traffic after recent algorithm updates by Google,
but some are attempts to build new aggregators like Huffington Post or that
News360 thing.

The scrapers are a nuisance, but don't harm me in any way. Google is free,
relevant traffic. Aggregators find an audience and provide useful content to
them with credit, probably using the RSS feed I publish for that purpose.

------
level09
Google is taking Cognitive Dissonance to a new level: It's okay for them to
scrape every single site, download its content and images, and cache it on
their servers, and run their AD platform on top of it. but that's not enough,
they would still like to impose their rules and punish people who do the same
thing.

~~~
eli
I'll concede that there's probably a middleground that's kinga gray, but are
you really going to defend bona fide scraper sites? Like ones that simply grab
all the text off some other site and repost it, adding no value? Google is
obviously adding value by providing snippets from Wikipedia.

------
fear91
Google seems less and less connected to reality the bigger they grow.

It's a shame that the search engine market share isn't split evenly by several
different engines. I think it would be beneficent both to the users and
website owners. Right now everyone tries to court Google and they seem to do
whatever the fuck they want.

~~~
jerf
It's also worth remembering they're under continuous, distributed assault by
human-intelligent agents (at least to a first approximation) trying to game
them specifically. The miracle is that Google works at all.

------
smoyer
There are lots of places where Google decides to "help" me, but sometimes I
just want search results. Other times, I actually like getting the curated
content (e.g. search for "delta 3810"). Is there a way to disable this?

EDIT: I should also note that I'm one of those who switched over to DuckDuckGo
for privacy reasons, so I don't see these results as often now.

~~~
7952
The content they do provide is often so bad it is almost embarrassing. Search
for "Russia" and you get a completely useless map and a list of random facts.
It may be useful to a child researching geography but for me it is just
annoying.

I want content that is curated by people who actually understand the subject.
I would pay for a search engine designed by someone who understands my
industry. The Google algorithm only manages to grab at the low hanging fruit.
I am a professional working on real stuff, I want something better than coffee
shop suggestions.

~~~
josefresco
I'm curious, what _should they show you_ if you type on "Russia" the term by
itself seems pretty generic and open ended.

"I would pay for a search engine designed by someone who understands my
industry"

What industry is that? And how would Google guess or know your industry unless
you tell them?

------
k-mcgrady
I'd love to see a response from Matt to that. If they think the Wikipedia
article is most important and they will scrape it and put it to the top why
not just put the wikipedia article as the top link and leave out the Google
box.

------
300bps
I wonder how Google chooses which Wikipedia articles they scrape and which
ones they don't.

In testing, they definitely don't seem to scrape every article:

[http://i.imgur.com/ujDqZhB.png](http://i.imgur.com/ujDqZhB.png)

~~~
danso
This is a good question...I've long since surmised that Google has a set of
heuristics for every site that has an API that allows for easy domain-specific
ranking. With Wikipedia, you have number of page edits, frequency of page
edits, and (to an extent) quality of recent page edits. StackOverflow provides
an even easier metric for what's considered high quality, and Google appears
to apply its own layer on top of that (and in my non-scientific perception,
looking something up by Google is almost always more fruitful on the first
search than by going directly to SO)

------
habosa
Hey guys you know this is meant to be humorous right? I honestly can't believe
that people here are saying Google is a scraper site and complaining about
"hypocrisy". No more caching! When I search Google I want them to freshly
crawl the web and get back to me in a day or two with my results.

</rant>

------
ITB
Google is most certainly crossing the line here.

1\. They are not only doing this with wikipedia, but with many, many sites:
"what is the smallest cell in the human body", "what is the biggest planet in
the solar system".

2\. The sites they chose to link are not always the highest quality sites,
such as the two examples above- why are these websites being featured?

3\. Many times, the user will get their answer right then and there, and be
done with the search process. The site misses a visitor. In spite of these
type of questions being "facts", someone took the time to organize and give
context to these "facts". Turning facts into useful, consumable, content costs
money. Google should not be taking visitors away from these sites.

4\. There should be public information on the CTR of these snippets. See if it
helps or hurts the user.

5\. Google is abusing its power as a major search engine to reinforce
structuring rules, such as microformats. With these rules, webmasters are
giving more and more semantic meaning to their content, which means Google has
an easier time completing their knowledge graph. They might link to the source
site for a while, but there is no good argument for linking back to wikipedia
to attribute the fact that Jupiter is the largest planet, since it's a fact,
just like 2+2 is 4 (no attribution).

6\. Google is all about ML/NLP/AI driven knowledge. But in reality they are
turning all of the internet content creators into a giant sweat shop for their
knowledge graph. This is not fair, and sooner or later it will come back to
bite them.

~~~
Shooti
All of your arguments are based on the underlying assumption that being a
"pure web search engine" is inherently better than their current striving
towards being an "knowledge engine" modeled after the Star Trek computer (of
which web results are a just a subset). I'm not sure that can be taken as a
given, if only because the later presents a much clearer model/metaphor in the
mobile first technology climate.

------
higherpurpose
Google should be _very_ careful with this. They don't want someone in power to
get the idea like "wait a minute...isn't Google mining whole websites too and
_profiting_ from it? Maybe we should do something about that!"

------
altcognito
Scraper sites usually don't reference the source material, but yeah, you might
want to get some ice for that burn.

------
Angostura
Of course, its not just Wikipedia these days - Movie theatres etc all suffer
it.

------
nkuttler
Indeed. Google is constantly doing things they punish other people for.

A happy DDG user, who still uses !g too often though.

~~~
air
[https://duckduckgo.com/?q=scraper+site](https://duckduckgo.com/?q=scraper+site)

~~~
nkuttler
I'm not sure which point you're trying to make. Did you look at op's
submission?

~~~
air
I assumed you thought scraping wikipedia and putting it on top of the search
results was unethical. (The other alternative - that punishing scraper sites
is unethical - seemed unlikely). So the fact that you prefer DDG because of
this seemed weird, considering DDG does the same thing.

~~~
ldng
DDG is clearly attributing to Wikipedia and is not cloaking the link behind a
redirect.

------
tobehonest
"All therefore whatsoever they bid you observe, that observe and do; but do
not ye after their works: for they say, and do not." \--Matthew 23:3

"Do as I say, not as I do" \-- Google

------
Grue3
Cue Bing, DuckDuckGo and any other search engine (except Google, of course)
being Google-killed for "scraping". It's the perfect plan!

~~~
solve
Google recently flagged my content-curation startup as "Pure Spam", even
though it only takes small snippets from the original sources, is 100% human
curated, and always links back to the true original source.

Not only are the curated pages blocked, but the entire domain is blocked as
"pure spam". People who use Google to find a domain instead of typing the full
URL now can't find it anywhere.

These assholes are just being anti-competitive now.

~~~
troels
Curious - Which site is that?

~~~
solve
You won't find it on Google :) We're making a products recommendation site,
focusing on goal-oriented decisions. We'll make the full announcement within
the next couple of weeks.

This reminds me. Google is really killing the "release early and release
often" approach, if people will now have to do a ton of SEO learning and
tweaking to avoid having your MVP permanently banned at launch day.

~~~
pinakothek
"Google is really killing the "release early and release often" approach, if
people will now have to do a ton of SEO learning and tweaking to avoid having
your MVP permanently banned at launch day."

Judging by the downvotes this is not a legitimate concern? Why?

~~~
PaulHoule
Actually, from an SEO perspective, the #1 principle Google follows is
preventing "release often" from being effective for SEO.

Talk to anyone who makes money at PPC and they will tell you one thing. You
make a campaign, measure the results, change it a little, measure the results,
and make incremental improvements to make a profitable campaign.

If you could do that with SEO, SEO would be a lot easier. Google, therefore,
has a number of mechanisms (some patented) that cause all hell to break loose
if you make the kind of changes to your site that you'd use to incrementally
improve it's SEO.

It's one of the reason we are stuck with crappy sites like answers.com,
w3schools, and wrongdiagnosis, because once a site like that is successful,
the operators are loathe to make any changes lest their rankings drop.

------
ricg
Easy. Search for a programming related question. After the result from
stackoverflow you'll find dozens of scraper sites.

~~~
ntaso
If you read correctly, the task is to report scraper sites that rank HIGHER
than the original site. Not the case in your scenario.

~~~
tobehonest
And in OPs context, not only does it rank higher, it's above ALL links. So you
don't even need to visit the target page and in affect stealing traffic (and
potential ad revenue) from the source.

------
sebii
More evidence: [http://shadyseo.com/](http://shadyseo.com/)

------
gwu78
I do not understand the Wikipedia definition of "scraper site".

By this definition webcache.googleusercontent.com qualifies.

It is a full copy of every site GoogleBot scrapes.

Google gives attrition to the original source, but if this isn't "scraping",
what is?

They have been sued for this, and they've won. The benefits of a decent search
engine outweigh the burden of infringing the copyrights of others. At least
where Google and other search engines that cache websites are concerned.

~~~
gwu78
s/attrition/attribution/

------
baldfat
Not funny since it is a double post for the same wikipedia:
[http://en.wikipedia.org/wiki/Scraper_site](http://en.wikipedia.org/wiki/Scraper_site)

Seriously that was just a stretch, but they both say the full url. So all of
Google News is a scraper site and any other summery given is a scrapper site
then. Sad.

------
return0
Joking aside, I see these people as the waste collectors of the web. Respect
their work, but i wouldn't want to do it.

------
cousin_it
That SERP should show only one result from Wikipedia instead of two. It should
be on top, have a blue title link to Wikipedia, _and_ look like an answer to
the user's question. That could be done by a general mechanism that lets every
site customize their representation on the SERP, or by a special case for
Wikipedia.

------
bhartzer
Who is deemed to be the scraper? The site that get crawled and indexed first,
and ranks better, or the site that ranks well but has sites with scraped
content that doesn't rank as well?

Matt is looking for scapers that rank better than the original, basically
meaning that they have higher PageRank and more links.

------
lazyjones
I would report Google to him, but I'm afraid he's not planning to act
fairly/consistently ...

------
MitziMoto
Google should offer some kind of revenue sharing (Like Youtube) to the sites
it's "stealing" visitors from by showing information directly. And you should
have to opt into it through something like webmaster tools.

------
rip747
why they don't just integrate this into the results page? what's wrong with
having up and down votes or a report this link button for the results?

------
globalpanic
I thought this was largely taken care of by Google Panda?

------
motyar
I know one, Google.com

------
iamabraham
Sensational.

------
pearjuice
Ah, a whole thread filled with pseudo-intellectual discussion about what
scraping is (or isn't) due to some silly snarky-joke which Matt is probably
laughing at, too. Hacker News to the rescue!

