I'm guessing a lot of it has to do with determining where the content originated...

melvinram · on May 12, 2011

I don't know if it's an easy fixer but it's certainly not difficult to eliminate 90% of scrapers. My sites get scraped all the time and if you look at these scrapper sites, they usually are not scrapping just one site. To simplify the issue, let's look a small data set:

  Site 1:
    * Content ABC
    * Content DEF
    * Content GHI

  Site 2:
    * Content JKL
    * Content MNO
    * Content PQR
    
  Site 3:
    * Content STU
    * Content VWX
    * Content YZ0

  Site 4:
    * Content ABC
    * Content DEF
    * Content MNO
    * Content PQR
    * Content STU

Which of these is a scrapper?

jcampbell1 · on May 12, 2011

When you add site 5 in the mix that has:

* Content ABC

* Content DEF

It makes it much harder to identify.

If you are a blackhat SEO, then you keep track of the last time google indexes you (including the anonymous crawlers, which is tricky), and backdate scraped content to just ahead of the time you were last indexed. Then you can send a complaint through googles tools about the site that wrote the original content.

Blackhats make content duplication a really challenging problem, and having a complaint form isn't going to solve much. The blackhats can take advantage of that as well.

lacker · on May 12, 2011

Google does eliminate more than 90% of scrapers. It's just really easy to create scrapers; they can outnumber the original content by more than 1000-1. So you need many, many systems to remove scrapers.

Devilboy · on May 12, 2011

The scraper is the one where the content appeared last.

lurker19 · on May 13, 2011

How do you perform that measurement using practically bounded computing and networking resources?

inetsee · on May 12, 2011

It seems to me that there should be some way for Google to see the original creators content before it goes live to the internet; some sort of pre-publishing protocol. Essentially tell the Googlebot "This is the content I am planning to publish; have a look so you will know who created it first."

lacker · on May 12, 2011

(I used to work on Google search quality.) In general, the problem with such systems is that scraper-writers use them much more diligently than original-content-writers.

forensic · on May 13, 2011

Has this been integrated into Blogspot?

The trick would be getting the feature hooked into the major publishing tools.

protomyth · on May 12, 2011

A human can figure it out. That's the real problem. You have a guy send in a support ticket / request, and then you pawn them off on someone who really cannot do anything about it. It really seems like support is just there to deflect rather than solve. Heck, look at the actual forum thread the article links to http://www.google.com/support/forum/p/AdWords/thread?tid=0bb...

It just seems like a purely algorithmic solution doesn't really scale and some human intervention is really necessary.

forensic · on May 13, 2011

Humans don't scale either. That's why google doesn't use humans.

Sounds like the easy solution here is to have people submit their content to some google authentication system at the same time they submit to their blog. Problem solved.

robryan · on May 13, 2011

They don't scale but Google can afford to hire enough to make the problem go away if they made it a focus. But not going to happen any time soon when they don't believe in customer service.

chmike · on May 12, 2011

It doesn't need to do it between billion of pages. As soon as they detect duplicates, there is a high chance that it will be done again. So monitoring the two or more site with same content is doable but not very efficient. Ideally, the author subject to scrapping should be able to reference its page to google to prove anteriority. The scraper won't be able to do that.