

The Ultimate Guide To Duplicate Content - davidwhitehouse
http://www.david-whitehouse.org/blog/duplicate-content/

======
teyc
Interesting that you mentioned it is 8 words in a row. The usual way duplicate
content is checked for is using a moving hash. I had one that I put up online
a few years ago.

Other content that should be checked is the printable versions should not be
indexed. Wordpress also generates some crawlable urls that cause the same
content to be returned, e.g. archives, categories.

Presence of session ids in urls used to be a big problem, but search engines
seems to have matured and a bit cleverer. However, if you aren't using a well
known CMS, then it is better to make sure you link to a canonical version and
don't put the one with session id in the search engine index.

------
seriocomic
Agree with @bauchidgw - <http://news.ycombinator.com/item?id=2417177> \-
there's a few missing...(camel-case URLs, capitalized URLs, folder re-
ordering, rewritten/not redirected URLs, root vs index.xxx etc) Also missing
(considering this is meant to be the 'Ultimate Guide' is a comprehensive
definition of duplicate content, consisting of "appreciably similar content
between one uniquely accessible URI and another", along with canonicalization
and more.

But a good start non-the-less for those who keep tripping up on this simple to
fix issue.

------
eli
"www vs non-www"

I've heard this before, but I have a hard time believing it. Google really
can't figure out on its own that for some sites the www is optional?

~~~
davidwhitehouse
I don't think they can, unless you submit both to Google Webmasters Tools and
then select a default domain to show.

Besides, why bother risking it when it's a 5 minute job?

~~~
benologist
Because the internet at large is not going to do that 5 minutes of work - just
like with web standards, it's up to browsers to support the decade-of-html
that was produced before anyone gave a crap about the w3c.

~~~
a5seo
Googlebot is a not a browser. It's a distinct "user" of your site with special
needs.

Think of Google as a retarded 30 year old user of site. If 67% of your users
relied on the opinion of that single user, you'd damn sure do 5 mins worth of
work to help him along.

And if you didn't it would be your own fault for failing to get his reference
(read: ranking).

~~~
benologist
Google's indexing problems remain Google's problems though - the internet at
large has demonstrated again and again that large portions are not going to
update, whether it's Flash video, malformed HTML, canonical issues, duplicate
content issues, dependencies on JavaScript or worse specific versions of
specific browsers, etc etc.

Just like with browser vendors fixing bad HTML themselves in their rendering
engines this is something Google has to fix themselves, because it's one fix
for the whole internet instead of 11 billion fixes being delegated to people
who just aren't going to do it.

It's a shitty job but they volunteered for it. And make billions doing it.

I am sure there are massive sections of the internet that are never going to
get more than incidental traffic from Google - all the sites that just don't
know/care about SEO, not to mention all the ones that just can't outdo the
adsense and affiliate spammers and content farms.

------
bauchidgw
missed a few:

mixedcase urls (on a windows server), ending slash vs no ending slash, double
slashes in url, too much border (nav, footer, right hand side) html on pages
with otherwise minor content, indexed ip address urls, additional url (i.e.:
tracking parameters), .... and these are just on top of my head

~~~
davidwhitehouse
I got the IP address URLs - those are usually staging servers, I also was
going to mention tracking parameters with Google Analytics as an example, but
it really isn't that common.

Definitely should have added the slash vs no slash one though, will add later,
feel free to add them yourself if you like ;)

