

Google tries to crawl pages that might exist - parham

The Google bot has recently started randomly crawling pages, specially for profile and news pages e.g. &#x2F;profile&#x2F;&lt;random_name&gt; or &#x2F;news&#x2F;&lt;random_id&gt;<p>Neither the profile or news path prefix are used in my site&#x27;s urls.<p>Some examples random names were “BobHope” and “suitcase_murphy” these aren&#x27;t included or related to the site in question at all.<p>I&#x27;m not complaining this is just an observation, has this happened to anyone else?
======
ancarda
I downloaded all my logs from a public-facing development server. Hitting
anything returns "403 Forbidden" except for a few domains which should not be
crawled (google does it anyway). Most traffic is usual (hit / and /robots.txt
then leave). All I found (in 41 rotated logs) was:

    
    
        access.log.12:66.249.73.132 - - [29/Oct/2013:01:26:15 +0100] "GET /?ac=2 HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
        access.log.12:66.249.73.132 - - [29/Oct/2013:03:24:12 +0100] "GET /?tag=lazy HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
        access.log.3:66.249.66.57 - - [06/Nov/2013:12:12:42 +0100] "GET /?ac=2&slt=8&slr=1&lpt=1 HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
        access.log.4:66.249.75.57 - - [05/Nov/2013:19:14:19 +0100] "GET /?cat=1 HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
        access.log.5:66.249.66.109 - - [04/Nov/2013:15:54:01 +0100] "GET /?ac=2&slt=8&slr=1&lpt=1 HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
        access.log.5:66.249.75.18 - - [05/Nov/2013:03:08:27 +0100] "GET /?ac=2&slt=8&slr=1&lpt=1 HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    

I have no idea what "?ac", "?cat" or "?tag" are suppose to do. Nothing on the
server responds to GET params (I use url rewrites and POST only) so I don't
think I ever made a link (not even accidentally).

I found nothing for "/profile" or "/news"

~~~
andrewcooke
i've found an example of a similar url -
[http://stackoverflow.com/questions/19674788/redirect-url-
hav...](http://stackoverflow.com/questions/19674788/redirect-url-having-query-
string-to-404-page-using-htaccess) \- so i am wondering if you have an ip that
previously belonged to someone using whatever software that is (some kind of
index or pagination i assume) and which had links with a numeric ip
address....

~~~
parham
Oh wow didn't think of that, that is the reason, the IP address I'm using is
an Elastic IP from AWS that I'm reusing from an old project, that project had
those exact same links.

------
benologist
The chance of either of those URLs being present on a site would have to be
miniscule. Most likely they followed some outdated link somewhere.

~~~
parham
Very unlikely as it's a development server (I should've mentioned) I doubt
anyone linked back.

~~~
benologist
Here's two simple ways links could have happened:

\- someone having the domain before you

\- script kiddies machine-generating links to pages on random domains so
Google finds exploitable targets for them

Google is not randomly looking for pages - that's brute-forcing a site map and
it's utterly, impossibly inefficient even to supplement what they find
crawling around.

------
throwaway420
Possibly they're checking random URLs to see if a site is erroneously
redirecting what should be 404s to random content pages?

~~~
andrewcooke
if you were going to write that code, would you use a url like that or a long,
random string that you know will not exist? and why wouldn't you add at the
end "this will not exist don't worry"?

