

Amazon Web Services: Clouded by Duplicate Content - grep
http://www.seomoz.org/blog/amazon-web-services-creator-of-mass-duplicate-content?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+seomoz+(SEOmoz+Daily+Blog)&utm_content=Google+Reader

======
jauer
Wouldn't it be simpler to use VirtualHost so you only respond with content to
requests for your domain?

Then set it up so requests without the domain name get a 301 redirect to the
canonical URL.

~~~
blantonl
on Apache, SSL and VirtualHost don't play together.

~~~
dhess
They do with Apache 2.2.12 or later and any recent browser:

<http://wiki.apache.org/httpd/NameBasedSSLVHostsWithSNI>

<http://en.wikipedia.org/wiki/Server_Name_Indication>

I'm using SNI on all of my domains, and it works great.

------
akirk
I don't quite understand why this article doesn't recommend using <link
rel="canonical" href="..."> as described at
[http://googlewebmastercentral.blogspot.com/2009/02/specify-y...](http://googlewebmastercentral.blogspot.com/2009/02/specify-
your-canonical.html) (resp.
[http://googlewebmastercentral.blogspot.com/2009/12/handling-...](http://googlewebmastercentral.blogspot.com/2009/12/handling-
legitimate-cross-domain.html))

Such an easy solution to this problem.

~~~
wmf
The suggested rewrite rule is even simpler than putting a <link> tag in every
page.

------
bdb
Sorry, but the author is disqualified by this sentence:

"Now there were no external links to these AWS subdomains but, being a domain
registrar, Google was notified of the new DNS entries and went ahead and
indexed loads of pages."

~~~
gxti
To clarify:

* Google isn't a domain registrar. If they were then one would hope that google.com would be registered under them, not markmonitor.com (a popular corporate registrar)

* The server hostname is a subdomain of amazonaws.com and doesn't require registration.

* In fact, the entire name is autogenerated from the IP address using a simple substitution rule.

* I'm not sure how Google finds these, because their own "what links here" doesn't turn anything up.

* Side note: while searching for backlinks in Google, I noticed that this article seems to have been copypasta'd to a dozen other blogs. That's SEO, baby!

EDIT: Damnit HN, you look like reddit, why can't you use a sane text
preprocessor like reddit?

~~~
jonknee
... Google is a registrar and has been since 2005.

[http://news.netcraft.com/archives/2005/01/31/google_is_now_a...](http://news.netcraft.com/archives/2005/01/31/google_is_now_a_domain_registrar.html)

<http://www.icann.org/en/registrars/accredited-list.html>

~~~
gxti
Thanks for correcting me, I've never used google apps so I'd never noticed you
could register domains there. Personally I'd still register my own domains
elsewhere as I've had bad experiences with combined registration and hosting,
and even the almighty Google has been known to partake in shenanigans.

In any case I'm pretty sure you don't even have to be a registrar to get lists
of newly registered domains, not that it matters because the rest of my post
addresses why registration has nothing to do with this.

------
dedward
"Now there were no external links to these AWS subdomains but, being a domain
registrar, Google was notified of the new DNS entries and went ahead and
indexed loads of pages"

Domain registrars wouldn't be notified of new RR's inside a second-level
domain - that would be pointless.

I can't see any way they would ever index a URL that used a dns RR that was
brand new - I'd hazard a guess that either the URL was used previously within
the cloud and published somewhere, or it was set up as a CNAME in your own
DNS, or your main webserver returned it as a response to a googlebot in some
fashion at some point.

------
madssj
I think we would all be better off just using an elastic ip address, and not
using the dynamic address for public websites.

Also, the same problem applies to normal servers where the webserver is
configured to show the website for the ip address, kind of like:

<http://174.132.225.106/>

which google also has picked up:

<http://www.google.dk/search?q=site:174.132.225.106>

------
bkrausz
Every website should have a similar redirect rule in there somewhere (I
implement it in PHP). If someone hits yoursite.com, you probably want to
redirect them to www.yoursite.com. I whitelist my domains such that if someone
goes goes to anything that points to my server and isn't a valid subdomain,
they get redirected to www.

------
joshu
Horseshit. Learn to configure your webserver.

------
rlpb
If accessing your web server via .amazonaws.com does not make sense for you,
why not just block (whether 403 or 404) all HTTP requests with a Host:
*.amazonaws.com header, rather than messing around with rewrites and
robots.txt?

~~~
ddemchuk
Because once that content is indexed, you can receive a negative effect from
Google by mass 403-ing a huge amount of content

~~~
rlpb
301 then it if this applies to you. It's still the same solution.

~~~
ddemchuk
No its not. Google has said that in situations like this, use a 301 redirect
to properly tell them where the actual content is located. Using a 403 is an
error page that will tell the spiders that that page doesn't exist anymore,
and that can have a negative impact on your rankings

