
Ask HN: So dunno how to reach Google but I think their bot is broken - emilfihlman
So I was doing some webdev and playing around with my projects. Got a 404 and decided to check my Nginx logs and this stared back at me, just scrolling by<p>SNIP<p><pre><code>  66.249.64.76 - - [04&#x2F;Feb&#x2F;2018:16:09:38 +0200] &quot;GET &#x2F;manual-motor-scania-dc12-53a-pdf-download.pdf HTTP&#x2F;1.1&quot; 404 169 &quot;-&quot; &quot;Mozilla&#x2F;5.0 (compatible; Googlebot&#x2F;2.1; +http:&#x2F;&#x2F;www.google.com&#x2F;bot.html)&quot; &quot;-&quot;
  66.249.64.76 - - [04&#x2F;Feb&#x2F;2018:16:09:38 +0200] &quot;GET &#x2F;15-incredible-uses-for-leftover-cake-the-craftsy-blog.pdf HTTP&#x2F;1.1&quot; 404 169 &quot;-&quot; &quot;Mozilla&#x2F;5.0 (compatible; Googlebot&#x2F;2.1; +http:&#x2F;&#x2F;www.google.com&#x2F;bot.html)&quot; &quot;-&quot;
  66.249.64.76 - - [04&#x2F;Feb&#x2F;2018:16:09:38 +0200] &quot;GET &#x2F;tab-top-curtains-drapes-youll-love-wayfair.pdf HTTP&#x2F;1.1&quot; 404 169 &quot;-&quot; &quot;Mozilla&#x2F;5.0 (compatible; Googlebot&#x2F;2.1; +http:&#x2F;&#x2F;www.google.com&#x2F;bot.html)&quot; &quot;-&quot;
  66.249.64.76 - - [04&#x2F;Feb&#x2F;2018:16:09:38 +0200] &quot;GET &#x2F;en-dwg-pdf.pdf HTTP&#x2F;1.1&quot; 404 169 &quot;-&quot; &quot;Mozilla&#x2F;5.0 (compatible; Googlebot&#x2F;2.1; +http:&#x2F;&#x2F;www.google.com&#x2F;bot.html)&quot; &quot;-&quot;
  66.249.64.74 - - [04&#x2F;Feb&#x2F;2018:16:09:38 +0200] &quot;GET &#x2F;the-fda-top-ten-pdf.pdf HTTP&#x2F;1.1&quot; 404 169 &quot;-&quot; &quot;Mozilla&#x2F;5.0 (compatible; Googlebot&#x2F;2.1; +http:&#x2F;&#x2F;www.google.com&#x2F;bot.html)&quot; &quot;-&quot;
  66.249.64.74 - - [04&#x2F;Feb&#x2F;2018:16:09:38 +0200] &quot;GET &#x2F;philips-fw-c785-user-manual-pdf-download.pdf HTTP&#x2F;1.1&quot; 404 169 &quot;-&quot; &quot;Mozilla&#x2F;5.0 (compatible; Googlebot&#x2F;2.1; +http:&#x2F;&#x2F;www.google.com&#x2F;bot.html)&quot; &quot;-&quot;
  
</code></pre>
SNAP<p>I have no idea how long this has been going on but I have already over 40MB of this in just one log file starting at 06 in the morning. The IPs do match Google&#x27;s, too.
I mean, I did ask my site to be indexed but these don&#x27;t even seem like something you&#x27;d index but something you&#x27;d use as search terms!<p>Did someone break a config? Am I just a n00b? Am I under an attack from a l33t h4x0r? What the heck is going on?<p>Find out in the next epis^Wfew hours of: Ask HN!
======
luos
It's possible that your site's IP was used previously for some other sites and
it was indexed as [http://IP/..](http://IP/..). now google will retry to fetch
the urls for a time.

Maybe you have someone's domain resolving to your IP. Seems to me that nginx
is resolving the IP to the default site in the config.

~~~
otp124
I think it is likely the first option (site IP was previously used), as I
searched for some of those paths and they look to be spread across different
domains. Perhaps a shared host with many domains pointing to the same IP?

~~~
emilfihlman
I've had the same IP for 4 years. I think it's unlikely that Google started
scanning after 4 years.

------
f_allwein
are you saying Googlebot is accessing these URLs on your domain, but they
should not/ do not exist? One possiblity is that your stie was hacked and
somone did add these URLs for spammy purposes. Do a [site:yourdomain.com]
search just to be sure.

And while you can't reach Google directly, you should be able to get more
expert support in their Webmaster Help forum:
[https://productforums.google.com/forum/#!forum/webmasters](https://productforums.google.com/forum/#!forum/webmasters)

~~~
emilfihlman
That's exactly what I'm saying. I'm rather confident that there's no one
inside my machine.

And I mean, it doesn't affect me at all, I'm just curious why it's happening!

------
borplk
I don't know exactly what's going on but it may have something to do with
referrer spam.

Those URLs may have appeared somewhere else or in some other way fed into
Googlebot.

~~~
emilfihlman
Yeah, and they are valid documents at least by checking a few.

The interesting question is: why me and why isn't Google detecting such
sillyness?

------
tedmiston
May or may not be related but I've seen similar looking spam traffic in my
Google Analytics accounts historically.

[https://www.optimizesmart.com/geek-guide-removing-
referrer-s...](https://www.optimizesmart.com/geek-guide-removing-referrer-
spam-google-analytics/)

It might be a spammer spoofing the Googlebot user agent; the IP looks legit
though.

Do you have a sitemap or robots.txt on the site?

~~~
emilfihlman
Yeah the first idea was of course that someone is spamming me but:

Like you said the IPs seem legit

And it's just a personal site, no ads or paid content or services.

No sitemap or robots.

------
gildas
I don't know the cause of the issue but searching for these documents in a
search engine may help. I found this spammy page for example
[http://cat2.inqbaytor.com.au](http://cat2.inqbaytor.com.au).

------
geoah
Log a full request and check out the host header. That will tell you what
domain they are hitting and you can check their DNS. They will probably be
pointing to your ip.

------
packetized
Are you on Dreamhost?
[http://automodelista.com.au/](http://automodelista.com.au/) ?

~~~
emilfihlman
Nope, but it is from a VPS provider, though one I've had for many years.

~~~
packetized
Changed IPs recently?

~~~
emilfihlman
Nope, as far as I know I've had the same IP for years.

------
blakesterz
Are you running WordPress? At some point was it hacked? Is it hacked now? If
it was at some point, maybe those were all real pages in the past. If it is
now, maybe the bad guys just changed what they were doing. I've seen pages
like this generated in hacked WordPress sites.

~~~
emilfihlman
Nope, never had WP on my machine.

------
dmix
Any spammer/automated URL scanner can set their user host was "Googlebot".
Doesn't mean it's Googlebot.

~~~
yeukhon
Not really. A spoofed ip during handshake will fail. course, unless OP
configures the log file format such that Nginx is told to log something not
the real ip address. If your user enters your load balancer/proxy server (e.g.
Nginx) and then you forward that HTTP request to your backend, the backend
sees the ip address of your load balancer/proxy server, but conventionally you
can set X-Forwarded-For header, then backend can read the X-Forward-For and
"trust" that's the original IP.

But this header can be set to whatever the value you want, if you have the
control of the load balancer/proxy server. You can also

~~~
emilfihlman
Default Nginx log format so the ip should be the real deal, at least according
to Nginx. No load balancer or proxy is used.

------
emilfihlman
As an update: still getting the traffic

------
emilfihlman
A larger sample

    
    
      66.249.64.74 - - [04/Feb/2018:15:46:13 +0200] "GET /letters-papers-from-prison-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:13 +0200] "GET /xerox-phaser-8560-service-manual-pdf-download.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /syllabus-of-basic-education-2017-statistics-and.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /challenge-champion-305-tc-paper-cutter.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /frost-nixon-wikipedia.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /dremel-multi-max-3-3-amp-corded-oscillating-multi-tool.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /the-effects-of-fabrication-methods-and-cure-cycle-on-the.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /common-paint-problems-with-solutions-and-preventions.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:14 +0200] "GET /introduction-to-criminal-justice-university-of-texas-at-dallas.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /blank-business-card-template-free-premium-templates.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /performing-arts-bursaries-in-south-africa-2014-full-download.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /the-light-of-the-world-sermons4kids-com.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /sccm-ecmo-management-download-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /k-rcher-1-516-260-0-dampfreiniger-sc-1-amazon-de-baumarkt.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /career-coaching-career-counselling-services-barnes-london.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:15 +0200] "GET /panasonic-viera-tc-p42s30-service-repair-guide-pdf-download.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:16 +0200] "GET /symptom-checker-check-your-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:16 +0200] "GET /eny-228-ig124-springtails-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:16 +0200] "GET /empirical-research-explorable.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:16 +0200] "GET /baby-whale-nursery-decor-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:16 +0200] "GET /kubota-b26-shop-manual-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:16 +0200] "GET /best-online-genealogy-service-myheritage-vs-ancestry-vs.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /advanced-quantitative-precipitation-information.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /international-phonetic-alphabet-pdf-file-scholarly-search.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /developmental-disturbances-in-teeth-srm-university.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /viper-5706v-pdf-manual-download-for-free.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /1-18-27-1-loi-du-1er-ao-t-1987-nationalit-livre-1er-du.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /skybrary-aircraft-performance-pdf.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:17 +0200] "GET /richmond-va-office-of-budget-and-strategic-planning-home.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:18 +0200] "GET /exploring-aromatics-academy-of-nutrition-and-dietetics.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      66.249.64.74 - - [04/Feb/2018:15:46:18 +0200] "GET /proposed-four-storey-public-city-library-scribd.pdf HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

