
Wikipedia's robots.txt - tosh
https://en.wikipedia.org/robots.txt
======
jedberg
This seems unnecessarily complicated and antagonistic. Mostly just to publicly
shame a bunch of people.

At Reddit we originally blocked a couple of crawlers but then realized how
pointless that was. The entire robots file[0] is basically now just for
google. All of the restrictions are enforced on the server side because there
were so many bad bots it didn’t matter if we listed them or not.

[0] [https://www.reddit.com/robots.txt](https://www.reddit.com/robots.txt)

~~~
wybiral
GitHub has an interesting entry in theirs [0]. The only user that's in the
Disallow section is "/ekansa".

He explained it a bit on a Twitter thread [1] where I brought it up a while
back and it sounds like they didn't like the fact that he was using GitHub to
host XML files for another service because of the traffic from crawlers it
created.

[0] [https://github.com/robots.txt](https://github.com/robots.txt)

[1]
[https://twitter.com/ekansa/status/1137052076062650368](https://twitter.com/ekansa/status/1137052076062650368)

~~~
philpem
User-agent: *

Allow: /humans.txt

Disallow: /

Nice!

------
girst

        # Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
        # and ignoring 429 ratelimit responses, claims to respect robots:
        # http://mj12bot.com/
        User-agent: MJ12bot
        Disallow: /
    

Coincidentally, I've just read more negative things about MJ12bot last week:
[http://boston.conman.org/2019/07/09.1](http://boston.conman.org/2019/07/09.1)

~~~
spc476
You can read the rest of my MJ12Bot saga:
[http://boston.conman.org/2019/07/09-12](http://boston.conman.org/2019/07/09-12)
My take: they are grossly incompetent at programming.

~~~
syxun
The best robots.txt for Majestic:

    
    
        iptables -A INPUT -s 207.244.157.10/32 -j DROP

~~~
spc476
If only it was that easy. Last month MJ12Bot hit my site from 136 distinct IP
addresses. If we drop the last octet, it's 120 unique class-C addresses, and
if we drop the last two octets, then 43 unique class-B addresses (and why not
---31 distinct class-A addresses). It's a distributed bot. Very hard to block,
so I think I came out ahead by them no longer spidering my site.

Edit: Added count of class-A blocks.

------
akrulino
Wikipedia's rationale to exclude articles for deletion pages is interesting:

# Folks get annoyed when XfD discussions end up the number 1 google hit for
their name.

~~~
greglindahl
There are a lot of webpages with my name on them, and the Cuil search engine
put the XfD discussion about deleting my Wikipedia page (because I'm not
significant enough) on the first page of Cuil's results for my name. I was
thrilled :-D

------
ungzd
> User-agent: TeleportPro

I remember using it to download whole (small) websites on dialup and then to
read offline.

~~~
striking
In Wikipedia's case, they'd prefer you download a dump or use an app
specifically designed to read Wikipedia offline. Rendering all that PHP is (or
at least at one point was) expensive.

~~~
shakyshakyshaky
reference for above comment:

[https://meta.wikimedia.org/wiki/Data_dump_torrents](https://meta.wikimedia.org/wiki/Data_dump_torrents)

I, somewhat fancifully, keep two flashdrives with a wikipedia data dump on
them "just in case"

------
mtnGoat
I haven't put up a robots.txt in years, they are utterly pointless, IMHO. Even
google doesn't honor them(nor did any of the others, like Alexa, for many
years now). Read you web server logs, you'll see googlebot crawling pages it
should not, if it were honoring your robots.txt.

~~~
spc476
Have you checked the IPs to make sure they're actually Google? There's a
strong incentive to fake being Google.

~~~
mtnGoat
yes they IPs were owned by them, they even indexed pages they were asked not
to. :x

------
user9383781
I've always thought something like robots.txt was a bit silly when it's so
easy to ignore.

~~~
the8472
robots.txt is the sheet with the house rules on the wall (or part thereof),
not the enforcement of those rules.

~~~
trynewideas
Part of my monthly maintenance on an independent Mediawiki install is to
cross-reference our robots.txt (which is based on Wikipedia's) and server
logs.

If a client or IP range is misbehaving in the server logs, it goes into
robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny
list.

I've tried to automate that process a few times but haven't ever gotten far.
It's unending, though. Feels like all it seems to take is a handful of cash
and a few days to start an SEO marketing buzzword company with its own
crawler, all to build yet another thing for us to block.

------
ufo
> Please do not remove the space at the start of this line, it breaks the
> rendering.

Does anyone know what this comment means? It is towards the end of the file.

~~~
bawolff
Its because the robots.txt file is created by combining a static file (
[https://github.com/wikimedia/operations-mediawiki-
config/blo...](https://github.com/wikimedia/operations-mediawiki-
config/blob/master/robots.txt) ) with an on-wiki page so that different
language wikipedias can manage the file themselves -
[https://en.wikipedia.org/wiki/MediaWiki:Robots.txt](https://en.wikipedia.org/wiki/MediaWiki:Robots.txt)

In wikisyntax, starting a line with a space ensure's that its in a pre tag, so
this was a way to make the onwiki page show the entire thing in a pre tag and
not use normal formatting. This seems to be broken by actually starting a
literal pre tag inside the first line, but I'm pretty sure this used to work
as a way to display the entire thing in a pre tag.

For the curious, the code generating the robots.txt file is at
[https://github.com/wikimedia/operations-mediawiki-
config/blo...](https://github.com/wikimedia/operations-mediawiki-
config/blob/master/w/robots.php)

