

Search Engines: Specify your canonical URL - ggrot
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

======
patio11
Regarding the question of whether Google is a monopoly or not: non-monopolies
cannot easily cause a new Internet standard to spring into being simply by
announcing that a program of theirs will now apply specified behavior to a
previously undefined syntactical element.

This line has a whole lot of chutzpah:

"This _standard_ can be adopted by any search engine when crawling and
indexing your site."

[Edit: Incidentally, I will have this implemented on my site by the end of the
day. Because I'd be an idiot not to. Google is, I think, probably the only
company who can create "drop what you are doing, now, this is your new
priority" work for me besides my _actual employer_.]

~~~
litewulf
(Agree with most of your points)

Yahoo and Google have both posted things before to the effect of "web authors:
it'd really help us out if you did X", and often times the other will adopt
the convention. The spec only describes some uses of link elements for
example, so it doesn't really seem like an abuse of anything as much as it is
a case of saying that something which was previously undefined now means
something to you (or them).

------
briansmith
You are much better off with doing the following: (1) All responses for non-
canonical URLs are 301 redirects to the canonical URLs, (2) Your website will
never link to a resource using a URL other than its canonical one, (3) you
encourage people to link to pages on your site using the canonical URLs.

This way your site will be very cache-friendly while still being usable. Also,
all search engines will be able to understand your site without any
proprietary extensions (a.k.a. "standards" at Google, apparently) being
needed.

~~~
wmf
I generally agree with that approach, but Google gives an example of pages
with query strings that need to have different URLs because they are subtly
different, but not in a way that search engines need to care about.

OTOH, Google's wiki example is bogus; people have been telling MediaWiki that
they should be using 301s for years but they just won't. This workaround just
encourages them to never fix it.

~~~
zepolen
It's not just one example, you can 'hurt' a website simply by making a huge
list of links with arbritrary query string garbage so that google picks it up,
eg:

<http://domain.com/?dupcontent>

<http://domain.com/?blabla>

What an app should really do is validate the arguments in the query string,
remove any invalid ones, then issue a 301 redirect to the proper url.

For example:

[http://www.google.com/search?q=someterm&unknown_variable...](http://www.google.com/search?q=someterm&unknown_variable=nothing)

redirects to

<http://www.google.com/search?q=someterm>

Of course that's like getting everyone's CSS to validate correctly :)

Edit: It is also why I prefer to sort my query string so that it can be
deterministic and always be the same no matter what order the args are in.

~~~
briansmith
That kind of attack won't work on a website that does a 301 redirect from all
non-canonical URLs to the canonical ones.

------
IsaacSchlueter
Why didn't they use the already established rel=bookmark value from the hAtom
microformat? That's already in the wild on countless blogs and websites.

I swear, sometimes Google's awareness of existing web conventions is
_shockingly_ lacking.

~~~
litewulf
Maybe I'm crazy, but a canonical URL is different from an hAtom permalink.

Besides, its a canonical URL for the whole document and not just a portion of
a page. How does Google know what the scope of a given rel=bookmark is? What
if people already use it, and using it in a way with a slightly different way
would pollute hAtom?

~~~
IsaacSchlueter
Well, that's why you'd use a <link> tag instead of a <a> tag.

Link tags in the head are information about the whole document. Anchor tags
are more vague in their semantics, and in the context of an hAtom item, <a
rel="bookmark"> would be the link to the canonical URL for that item.

For a document, <link rel="bookmark"> would be the canonical URL for that
document. As in, "If you are looking to save or _bookmark_ this document's
URL, you should do it using this URL over here."

------
aristus
(Black hat on) This might be very interesting for some types of injection
attacks. Instead of simply getting backlinks you could steal the pagerank of
your victims without leaving a visible mark. Limited to subdomains, though.

~~~
lsb
No, it's the same domain, so unless your victims are your co-workers, it won't
work.

------
buro9
I've got a site that runs on both http and https.

Is a protocol change enough of a difference to consider the identical content
as a duplicate?

