Hacker News new | comments | show | ask | jobs | submit login

Why doesn't google look at identical url paths on https, http, www, and no-www variants of the url, and if they look similar then use some default google policy to select which of them is canonical?

For example, if http://mydomain.com/path and https://www.mydomain.com/path have 95% content correlation and repeated requests to http://mydomain.com/path have 95% content correlation, and the server headers look the same, why would it not be safe to decide those are duplicates of a single canonical url?

It's not safe to merge www.domain1.com and www.domain2.com. it's not safe to merge subdomain.domain.com and www.domain.com. However, for the limited cases of www and no-www, https and http, if they look similar, I think it's harmful not to treat them as the same site. You can't expect every website owner to be aware of this issue.

If it's a matter of not being able to be 100% sure, is there a single site that cares about google ranking that runs different sites on different combinations of www/no-www and https/http, but has similar content that would confuse a simple heuristic looking at page similarity? In what sort of circumstance could that happen other than with placeholder pages?

GWT allows selecting a preference between www and no-www, but I don't see a preference between https and http. I think Google should add a notice that using GWT to select between www and no-www is deprecated and the recommended way to handle www, no-www, http, and https selection is to use 301 redirects or rel="canonical" tags.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact