
Google penalizes original content site because of scrapers - raphaelb
http://www.seobook.com/no-adwords-soup
======
clintavo
Google's latest algorithmic changes seem to be either horribly wrong or not
fully baked.

Our site went from ranking #8 for our target search "artist websites" to PAGE
440 of the results. Our listing for "how to sell art" just went away. There's
been nothing but original content on our site for 10 years, and, among
artists, we're considered one of the best sources of art marketing
information, given that I owned an art gallery for 20 years and all of our
other writers are professional artists. (and yet Google still has _ehow_
ranked for "how to sell art".....yeah, I'm sure ehow knows a _whole_ lot more
than we do).

I'm saying this not to vent, but to concur with Aaron and others that there is
something wrong. It may not hurt Google's business....the latest algorithm's
probably improve adsense revenue, and that's fine, it's their business.
Fortunately I've read HN long enough to know not to build my entire company on
top of someone else's platform and, as much as it upsets me, we don't need
Google. Bing (and Yahoo) have us at #3 for that same search ("artist
websites"). We don't depend on search engines as our only source of marketing
leads....nor even our main source.

The most frustrating thing is not even that it happens, but that they do not
communicate. There's no way to find out WHAT happened. Nothing in Webmaster
Tools. No way to pay for search support. I read the Google blog post with
guidelines on how to structure content after Panda and, none of that applied
to us, at least not that I could tell.

They say "just focus on users" and that's what we do, but I guess, that's BS.

I, frankly, think Google's gotten to big for their britches and as unlikely as
it is to happen, I hope Bing, Blekko and yes, DuckDuckGo take some market
share away. Windows is better for having OSX and Linux to compete with. Maybe
Google would be a bit less "evil" with more competition too.

Sorry for the bit of the rant, I'm usually only a lurker here, but this
article of Aaron's really hit close to home this week. At least there are a
couple of relevant points buried in my little rant....I hope ;-)

~~~
moultano
>the latest algorithm's probably improve adsense revenue, and that's fine,
it's their business

There's no reason to jump to that conclusion. We don't make ranking changes to
improve adsense revenue, and don't use it as a metric to evaluate ranking
algorithms. We don't even have a mechanism to collect the data.

~~~
clintavo
Thank you for replying, thank you for clarifying that. My question is since
we've followed Google guidelines for 10 years, never intentionally engaged in
any practices against Google's recommendations, are already doing the
suggestions outlined in the most recent Google blog post on this subject, yet
have, for some reason lost all of our rankings completely and are being
outranked by known content farms and sites with adsense, what should we do?

I realize you can't guarantee any results, but honestly, we really have
absolutely no clue what to do next. Every change we've made to our site in the
past 2 years has been to try to do everything we read that Google wants from
official Google channels. We truly just don't know what else to do.

Edit to add: we did 3 months ago change from a very long domain name to a
short one (faso.com) because it is shorter and we own a federal trademark on
the word "FASO" and thought that Google wanted to place more emphasis on
brands - our rankings stayed the same and even improved....until Monday.

------
tristanperry
Just like the anti-content farm blog posts previously, I'm half hoping that
the internet community turns its attention to the problem of content scrapers
(in the hope that Google take more action against the problem). I am a fan of
Google - and the biggest source of traffic to my websites is Google search
traffic - although the issue of scrapers does seem to be growing (even despite
the attempted anti-scraper Google algorithm update earlier in the year).

A couple of days ago I did some searching online and found that a fair number
of websites had copied some of my articles in their entirety. And sadly, a lot
of these 'websites' were actually Google Blogger (Blogspot) blogs. And whilst
some of these copied articles weren't appearing in Google search (I guess
since the entire site contained copied/scraped content, thus giving them a
Google SERP penalty?), some of the copied articles were appearing in the
SERPs. And a couple of these websites even had Google AdSense on them.

So there was the crazy situation whereby my content had been stolen/scraped
illegally, and put on a Google Blogger blog with Google Ads on it, and (in
some cases) that blog then received traffic from Google Search. Hrmph.

In the interest of balance, I will point out that I filed Google DMCA requests
after finding these scraped articles, and Google did promptly reply (a non-
automated reply around 30 hours later, which is quick considering how many
DMCAs Google must get).

They only removed the individual blog post (and not the blogs overall, even
though they were clearly spam blogs), but nonetheless I am happy with Google's
quick response.

I just wish that content scraping isn't (in some cases) a profitable
endeavor..

------
moultano
The headline is factually inaccurate. It looks like a mistake that he was
denied access to Adwords due to original content, but that has no connection
to his ranking in search. It looks like the adwords representative may not
have access to fine-grained enough tools to assess the site accurately, which
is an organizational failing, but there's no bad intent there. In some sense
it reflects how disconnected search and ads are from each other that they're
using crude tools to assess original content.

Google cares a great deal about putting the original source of a piece of
content first. If we're doing that incorrectly, it's because we screwed up,
not because that's how things are designed. It's a hard problem and an area we
are still working on intensely. It would be great if someone involved could
post the queries on which we are screwing up so we can debug what's causing
it.

~~~
wooster
You don't seem to have read the article.

The search:

["a superb app for iPad and iPhone that lets you quickly and easily transfer
photos and videos between iOS devices and computers – has been updated this
week, to Version 2.3."]

returned results from content scrapers above the original content.

For me, the original content doesn't even show up in search results, even
though it's in Google's index:

[http://ipadinsight.com/ipad-apps/photo-transfer-app-
updateda...](http://ipadinsight.com/ipad-apps/photo-transfer-app-updatedadds-
ability-to-transfer-photos-and-videos-in-the-background)

Further, Google wouldn't let the site owner buy AdWords to drive traffic to
their site.

Google owns both search and AdWords. This makes the headline here:

    
    
      Google penalizes original content site because of scrapers
    

accurate, as far as I'm concerned.

~~~
moultano
The results for that query are horrible, but his complaint about losing
traffic certainly isn't due to his ranking on 30-word quoted queries. I'm
hoping to get an example of a normal query where he's losing out to scrapers
so we can debug what's going on.

The headline implies that he's penalized in search due to scrapers, which
isn't happening.

~~~
jcampbell1
> The headline implies that he's penalized in search due to scrapers, which
> isn't happening.

In this case the phrase "Google penalizes" = "Google denies access to
adwords". The word _penalize_ doesn't always refer to site penalties in the
Google search index.

~~~
moultano
The reason it's on top of HN is almost certainly because that's the conclusion
everyone jumped to. That seems to be how the word "penalize" is consistently
used with respect to Google.

~~~
jcampbell1
I agree, I am a bit surprised this is on the frontpage. What is ironic though,
is than now when I search for the 30 word phrase, two scraper sites appear
with the article scraped from seobook.com. ...nicely backdated by 37 seconds.

------
staunch
Is Google doing anything to solve the content duplication problem?

It seems like a solvable problem. Why don't they let webmasters implement some
kind of time-based cryptographic signature?

It seems so lame that his problem has gone on so long, especially when there
must be some kind of technical solution.

For real businesses spending a few days implementing some authentication
protocol would not be particularly burdensome.

~~~
theseanstewart
I'm guessing a lot of it has to do with determining where the content
originated. Google will crawl sites at different times of the day. If site B
stole content from site A but Google came to site B first, how are they
supposed to know that the content originated on site A? I'm sure there's
certain things they look at like PR, trust signals, etc. to determine if the
content on site B could have been copied, but it can't possibly be perfect.
The time based signature sounds good in theory but implementing it across
billions of pages would be very difficult. Not only that, but what if site A
didn't create a signature but site B did when taking site A's content? I don't
think there's an easy solution to this problem.

~~~
melvinram
I don't know if it's an easy fixer but it's certainly not difficult to
eliminate 90% of scrapers. My sites get scraped all the time and if you look
at these scrapper sites, they usually are not scrapping just one site. To
simplify the issue, let's look a small data set:

    
    
      Site 1:
        * Content ABC
        * Content DEF
        * Content GHI
    
      Site 2:
        * Content JKL
        * Content MNO
        * Content PQR
        
      Site 3:
        * Content STU
        * Content VWX
        * Content YZ0
    
      Site 4:
        * Content ABC
        * Content DEF
        * Content MNO
        * Content PQR
        * Content STU
    

Which of these is a scrapper?

~~~
Devilboy
The scraper is the one where the content appeared last.

~~~
lurker19
How do you perform that measurement using practically bounded computing and
networking resources?

------
rmason
Here's a simple idea that could fix a lot of this problem. Copy Twitters idea
of verified accounts.

Google could issue verified sites. If someone copied a verified site the
content would be automatically removed from the index.

Now they would have to hire some staffers to research the applications and
handle complaints. But this is beginning to cost them far more than adding a
few more staffers.

~~~
zacharypinter
While certainly a potential fix, Google seems very averse to non-algorithmic
solutions (see their issues with customer support).

------
raganwald
I'm sure this is a _gross_ oversimplification, but Google is in the business
of monetizing people's interest in content it doesn't create. Who will it
perceive is the better partner for that monetization, the scraper who
understands how to apply Google's tools to maximize monetization, or the
original content author?

~~~
anonymous246
Umm, wasn't Google supposed to "Don't be evil" and care about their search
result quality about all else? I find your defense of and apathy to hypocrisy
disturbing.

And oh, your post pushed a hot button of mine: comments by people who think
they're being incredibly insightful by saying "the world is not fair" in
different ways. On HN, I think we can assume that people are adults and
understand such things. Sorry about the flame.

~~~
antirez
"Umm, wasn't Google supposed to "Don't be evil" and care about their search
result quality about all else?"

Mmmm... no, unfortunately Google is supposed to make money, and they make
money mainly from adsense. I'm not prepared to accept that random algo
modifications having big impact on their revenue are done without concerns.

And when you are at the same time the same company driving people to web sites
(search) and getting profits from ads showed in web pages (adsense) something
bad can happen as it is not a free market setup.

~~~
moultano
>I'm not prepared to accept that random algo modifications having big impact
on their revenue are done without concerns.

The effect on adsense revenue isn't even _measured_ let alone used. The people
who make decisions about changes to ranking algorithms don't even _see_ the
data on adsense revenue because we don't even have a system to _collect_ it
for ranking changes.

I work in search quality, and there are many metrics I have to collect to
launch changes. None of them involve ads.

~~~
antirez
I'm glad to hear that, and I trust you. If it is this way I expect to find
less spam engines or copied content in the next months when I search on google
:)

------
patrickj
I'm the guy whose site this is all about - iPadInsight.com. I only used that
specific search in the forum thread because I discovered that was the single
point on which the Adwords reviewer had judged that my site didn't produce
original content. I sent back the results of the same search showing that
links were either from legit aggregator sites (like alltop.com) linking back
to my original review, or from a number of scraper sites that rip my content.
Even when the review was overturned for Adwords I was told it would leave a
black mark against my site because it had already been marked that way. Great
system.

Soon after all that hassle, my site suddenly lost 60% of its traffic. From
what I can gather, mine is one of the quality sites that produce original
content that has been mistakenly penalized in the Panda / Farmer / Whichever
Other updates.

Among the reasons I say my site is a quality site that produces original
content, in accordance with this post at the Google Webmaster Central Blog
([http://googlewebmastercentral.blogspot.com/2011/05/more-
guid...](http://googlewebmastercentral.blogspot.com/2011/05/more-guidance-on-
building-high-quality.html)) and with all the logic I can apply to the
subject, are:

\-- The site contains over 1,700 posts published in the last 15 months. I
wrote around 1,550 of them myself. The remainder are written by three other
occasional authors, who are colleagues and friends of mine. There's no
'outsourcing' of content creation or anything of that ilk.

\-- I spend tons of hours every day researching and writing the content that
appears on my site. Every app review on the site is 100% original content
(<http://ipadinsight.com/category/ipad-app-reviews>), as are all posts
published.

\-- I do consider myself an expert on the subject my site covers - the iPad. I
have been writing app reviews,accessory reviews tips, how-to posts on it ever
since it launched. I've appeared on ABC World News and numerous radio programs
as an iPad and Apple expert. I've been a contributing author for iPhone and
iPad Life magazine (printed publication) since their debut issue - writing
expert tips and tricks posts, buyer's guide articles, and more. I'm listed in
Robert Scoble's Twitter list of best tech people to follow. Blue-chip app
publishers and accessory vendors approach me to write about their products.
The Daily (the first iPad only newspaper) contacted me before their app even
hit the App Store, as do many leading publishers. I've been a beta tester for
many top iOS apps for years. I participate regularly at several leading iPad
and iOS forums. I'm not saying any of this to boast, but in an effort to
establish that I'm a blogger who is enormously passionate about the subject I
cover, and someone who is respected in the area (mobile tech) that I write on.

\-- My site is a long-standing member of the Got-OATS group of sites
(<http://www.gotoats.org/>) that seek to uphold and promote the highest ethics
in app reviews. We never accept money for reviews or coverage, and add
disclosure statements to our reviews to indicate whether we received a promo
code for an app reviewed, or a sample unit of an accessory reviewed.

\-- I spend a lot of time on every single post, on researching, on testing
apps and whatever else I'm covering, on ensuring that spelling and grammar are
spot-on, on providing good screencaps of apps in action, and every other
detail I can think of.

\-- I use a great cache-ing plugin on my site and do my best, with help from a
few Wordpress experts, to keep the site fast and clean.

\-- I currently have close to 4,500 RSS subscribers and over 3,000 Twitter
followers for the site's account.

\-- Before my recent sudden traffic fall off a cliff due to Panda, my site had
around 80-100,000 unique visitors per month.

As for search results and scraper sites, I am still often seeing horrendous
spam sites ranking above me for recent posts. Here is just one quick example
on a recent post I wrote about iPad rivals, where several scraper sites rank
above mine, including one (ipads101.com) which I have submitted 3 spam reports
on via Google Webmaster over the last two months, and had zero response:

[http://www.google.com/search?sourceid=chrome&ie=UTF-8...](http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=ipad+rivals+the+year+of+the+clueless)

I run a good site. I pour hours of effort and my heart and soul into it. And I
think it has been very wrongly assessed by whichever new algorithm.

~~~
moultano
I get your site #1 for that search. I suspect the scrapers are ranking above
your site only when you search due to personalization (you've clicked on them
in the past.) Try adding &pws=0 to your search to see if they still rank
higher and I'll debug further.

This is a quite common problem actually when webmasters have reason to hate
certain sites. They click on those sites in search results a lot and often see
them promoted above their own, even though they're the only one who sees them
promoted.

~~~
js2
Searching for "ipad rivals the year of the clueless" with or without &pws=0 I
get 5 scraper sights before ipadinsight.com/category/ipad-rivals-2:

1\. usedipadforsale.net/ipad-rivals-this-year-the-year-of-the-copycats-or-the-
clueless.htm

2\. www.ipads101.com/ipad-rivals-this-year-the-year-of-the-copycats-or-the-
clueless/

3\. ipads2nd.com/.../ipad-rivals-this-year-the-year-of-the-copycats-or-the-
clueless/

4\. catsmakemebats.micasaessucasagermania.com/.../ipad-rivals-this-year-the-
year-of-the-copycats-or-the-clueless/

5\. ipad.thedailyglobe.com/.../ipad-rivals-this-year-the-year-of-the-copycats-
or- the-clueless/

6\. ipadinsight.com/category/ipad-rivals-2

That looks pretty bad to me.

~~~
pbz
For that query I get ipadinsight #1 in bing. The only other result from that
list that shows up is the "usedipadforsale.net" as #5.

------
blauwbilgorgel
I believe Panda also looks at originality, content freshness, document
authority, trust factors, usability factors, site authority etc.

I'd agree that the denying of Adsense was (obviously?) wrong, if this is all
there is to the picture. As for looking at ipad information on the internet,
after a manual inspection of that site, I, as a user that cares for quality
and relevance, don't need any of the results on justanotheripadblog.com in my
top 100.

The order of relevance, discovery and editorial quality seems to flow from:

<http://reviews.cnet.com/8301-19512_7-20023976-233.html>

>

[http://ipadinsight.com/ipad-tips-tricks/how-to-make-
airprint...](http://ipadinsight.com/ipad-tips-tricks/how-to-make-airprint-
work-with-just-about-any-printer/)

>

[http://www.info4arab.com/how-to-make-airprint-work-with-
just...](http://www.info4arab.com/how-to-make-airprint-work-with-just-about-
any-printer/)

With a lot of intermediate steps.

iPadInsight.com is not a cheap scraper site, but is it a site that does
original research, beyond rehashing what is hot in the industry? I think Panda
might have judged correctly in not assigning higher rankings to this site.

The site seems to have had a canonical problem with the comments in 2010,
inflating the site size in index to * 10. The depth of these comments is
usually not much more than: "Great! Interesting Article! Love this! Thank
you!" and might just as well have been auto-generated.

Also the shareasale footerlink "Thesis Theme for WordPress" alone, might
disqualify you for running Adsense, as you dofollow an affiliate link (and
this is not allowed in the webmaster quality guidelines).

The trademark inside domain name might be another issue.

------
Alex3917
The same thing happened to us as well. When I did an event last month our page
rank dropped from 4 to 2 despite getting 12+ new links, which I strongly
suspect is because all of the bloggers who link to our conference site are
having their blogs duplicated by content farms. Overall our page rank has
dropped from 7 to 2 in little over a year, despite having 10x more inbound
links. (And zero SEO or anything else that would violate Google's best
practices.)

I did ask a Google employee, who said it was because we weren't using
canonical tags, but this doesn't make much sense and fixing this doesn't seem
to have done anything to improve the situation.

~~~
jedc
The article above is about his problems getting an AdWords campaign approved,
not about his ranking in search results.

~~~
MichaelApproved
Those problems are one and the same. He can't get an adwords campaign approved
for the same reason his search results rank poorly, he's being scraped.

~~~
jedc
In the original post, he used a very specific query and his site was listed
2nd. Naturally the scrapers were also all over that results page, since they
were all exact matches to the text. I didn't see anything about him
complaining about poor search ranking
([http://www.google.com/support/forum/p/AdWords/thread?tid=0bb...](http://www.google.com/support/forum/p/AdWords/thread?tid=0bb4bb671eff7054&hl=en))

------
antimatter15
I don't understand the incentives for google to deny AdWords to someone.
AdWords is what ultimately gives Google the profits, not AdSense. Google
already has plenty of places for you to see ads, and AdWords is the product
that actually takes money acquired through other industries and funnels it
into Google.

I really don't think it's being done of malicious intent. I think it's very
likely that it's just being done because of negligence, since service/app
reviews happen to be frequently scraped.

~~~
chalst
The complaint is indeed about negligence, not malice.

Google puts some effort into making sure that (i) people aren't annoyed by ads
directing them to content they didn't want, and (ii) the top ads are good
quality and likely to have a good click-through rate. Ads that have a high
quality score cost less to place for that reason, ads with a low quality score
don't get placed, or get placed on the second page of ads.

------
pixcavator
Hell with Facebook’s replacements; it’s Google who needs a replacement! (Note:
I am currently thinking about possible alternatives for PageRank.)

