Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm guessing a lot of it has to do with determining where the content originated. Google will crawl sites at different times of the day. If site B stole content from site A but Google came to site B first, how are they supposed to know that the content originated on site A? I'm sure there's certain things they look at like PR, trust signals, etc. to determine if the content on site B could have been copied, but it can't possibly be perfect. The time based signature sounds good in theory but implementing it across billions of pages would be very difficult. Not only that, but what if site A didn't create a signature but site B did when taking site A's content? I don't think there's an easy solution to this problem.


I don't know if it's an easy fixer but it's certainly not difficult to eliminate 90% of scrapers. My sites get scraped all the time and if you look at these scrapper sites, they usually are not scrapping just one site. To simplify the issue, let's look a small data set:

  Site 1:
    * Content ABC
    * Content DEF
    * Content GHI

  Site 2:
    * Content JKL
    * Content MNO
    * Content PQR
    
  Site 3:
    * Content STU
    * Content VWX
    * Content YZ0

  Site 4:
    * Content ABC
    * Content DEF
    * Content MNO
    * Content PQR
    * Content STU
Which of these is a scrapper?


When you add site 5 in the mix that has:

* Content ABC

* Content DEF

It makes it much harder to identify.

If you are a blackhat SEO, then you keep track of the last time google indexes you (including the anonymous crawlers, which is tricky), and backdate scraped content to just ahead of the time you were last indexed. Then you can send a complaint through googles tools about the site that wrote the original content.

Blackhats make content duplication a really challenging problem, and having a complaint form isn't going to solve much. The blackhats can take advantage of that as well.


Google does eliminate more than 90% of scrapers. It's just really easy to create scrapers; they can outnumber the original content by more than 1000-1. So you need many, many systems to remove scrapers.


The scraper is the one where the content appeared last.


How do you perform that measurement using practically bounded computing and networking resources?


It seems to me that there should be some way for Google to see the original creators content before it goes live to the internet; some sort of pre-publishing protocol. Essentially tell the Googlebot "This is the content I am planning to publish; have a look so you will know who created it first."


(I used to work on Google search quality.) In general, the problem with such systems is that scraper-writers use them much more diligently than original-content-writers.


Has this been integrated into Blogspot?

The trick would be getting the feature hooked into the major publishing tools.


A human can figure it out. That's the real problem. You have a guy send in a support ticket / request, and then you pawn them off on someone who really cannot do anything about it. It really seems like support is just there to deflect rather than solve. Heck, look at the actual forum thread the article links to http://www.google.com/support/forum/p/AdWords/thread?tid=0bb...

It just seems like a purely algorithmic solution doesn't really scale and some human intervention is really necessary.


Humans don't scale either. That's why google doesn't use humans.

Sounds like the easy solution here is to have people submit their content to some google authentication system at the same time they submit to their blog. Problem solved.


They don't scale but Google can afford to hire enough to make the problem go away if they made it a focus. But not going to happen any time soon when they don't believe in customer service.


It doesn't need to do it between billion of pages. As soon as they detect duplicates, there is a high chance that it will be done again. So monitoring the two or more site with same content is doable but not very efficient. Ideally, the author subject to scrapping should be able to reference its page to google to prove anteriority. The scraper won't be able to do that.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: