
How to make fun of Google Bot (PicoLisp Wiki) - markokocic
http://picolisp.com/5000/-2-1i.html
======
skrebbel
I once made a similar (but less real language-like) site to fool spambots on
my now-defunct web consultancy's page (<http://www.resolution.nl/food> if you
care). The idea was that if a crawler that searches the internet for email
addresses to spam would fill its DB with bogus, after which, hopefully, the
spammer would simply dump that day's result in annoyance, including our real
email address. Never figured out whether that really worked but it was fun to
make.

What did work, however, was fool a searchbot: The whole thing got me a very
angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck
on it for an entire day. I had no robots.txt (didn't even know what it was),
which the search engine team decided was a really nasty case of lack of
netiquette.

~~~
troels
_The whole thing got me a very angry mail from a Dutch search engine team
(ilse.nl) whose bot had been stuck on it for an entire day._

So, somehow their ill-coded bot crashing was your fault? It's not like you
forced them to crawl your site.

~~~
skrebbel
Correct, which is why I laughed.

------
bauchidgw
see here for a spec how the robots.txt is parsed

[http://code.google.com/web/controlcrawlindex/docs/robots_txt...](http://code.google.com/web/controlcrawlindex/docs/robots_txt.html)

the robots.txt of <http://picolisp.com> (found at
<http://picolisp.com/robots.txt>) allows the indexing of
<http://picolisp.com/21000> and all follow up pages.

why? see the spec:

The disallow directive specifies paths that must not be accessed by the
designated crawlers. When no path is specified, the directive is ignored.

see here <https://github.com/franzenzenhofer/robotstxt> for coffeescript
implementation of a robots.txt parser

~~~
budgi3
so what should his robots.txt look like? at the moment it is:

User-Agent: *

Disallow: /21000/

~~~
saalweachter
It's _mostly_ sufficient. /21000/ will not match "<http://picolisp.com/21000>,
which is the first URL in the sequence, but the remaining URLs look like
"<http://picolisp.com/21000/!start?*Page=+2>, so Googlebot will likely only
continue to download a single page once it has re-read the robots.txt.

Which is what you deserve for using non-standard URL formats.

~~~
Florin_Andrei
Hold on, slash at the end is not standard?

~~~
saalweachter
No, I'm saying /21000/ will match a path with a directory named /21000 but not
a file named /21000.

When I say "non-standard", I am saying am saying that if the website's URLs
looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to
craft a "Disallow" rule that would have successfully blocked all of the
desired pages.

------
Matt_Cutts
The vast majority of the time when I see a complaint like this about
robots.txt, it's usually because the site has missed a character here or
there, or because they're not putting a robots.txt file on the right hostname.

Google has a free robots.txt checker that lets you test your robots.txt files.
Given a robots.txt file, you can enter specific urls and check whether that
url would be blocked or not. Here's a link for more info on that free tool:
[http://www.google.com/support/webmasters/bin/answer.py?hl=en...](http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237&rd=1)

------
coverband
Another incorrect assumption in the article is to assume that every bot with
the Google UA is originating from Google. There are plenty of other (often
malicious) bot sources that simply copy Google's signature to make themselves
less obvious. He needs to check the IP block to make sure the bot with the
observed behavior was really from Google.

~~~
rodion_89
I just took a look at the last offending Googlebot IP and it seems to
originate from Google.

<http://www.ip-adress.com/ip_tracer/66.249.71.203>

------
adpowers
That reminds me of this page in which the author created a large binary tree
of pages and watched how various crawlers walked the tree.

<http://www.drunkmenworkhere.org/219>

------
j_col
Very interesting experiment, and suprising that the Google bot appears (in
this instance at least) to be ignoring robots.txt.

~~~
saalweachter
The problem is not Googlebot.

robots.txt on ticker.picolisp.com says "Disallow: /", but ticker.picolisp.com
redirects to picolisp.com/21000, and the robots.txt on picolisp.com says
"Disallow:". If he wants Googlebot to stop crawling those URLs, he needs to
add "Disallow: /21000" to picolisp.com.

~~~
j_col
Hmmmm, maybe he just did: <http://picolisp.com/robots.txt>

------
sygeek
Dumb bots are dumb

