Hacker Newsnew | comments | show | ask | jobs | submitlogin

But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Because we don't understand what's hard, we think you're not really trying, and then we make up evil reasons to explain that.

I believe if people understood better the difficulties of spam fighting they would be more understanding.




> But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Not necessarily. The rate at which Google refreshes its crawl of a site, and how deep it crawl, depend on how often a site updates and its PageRank numbers. If a scraper site updates more often and has higher PR than the sites it's scraping, Google will be more likely to find the content there than at its source. Identifying the scraper copy as canonical because it was encountered first would be wrong.

-----




Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: