Hacker News new | past | comments | ask | show | jobs | submit login
Wikipedia's robots.txt (wikipedia.org)
105 points by tosh on July 16, 2019 | hide | past | favorite | 38 comments

This seems unnecessarily complicated and antagonistic. Mostly just to publicly shame a bunch of people.

At Reddit we originally blocked a couple of crawlers but then realized how pointless that was. The entire robots file[0] is basically now just for google. All of the restrictions are enforced on the server side because there were so many bad bots it didn’t matter if we listed them or not.

[0] https://www.reddit.com/robots.txt

GitHub has an interesting entry in theirs [0]. The only user that's in the Disallow section is "/ekansa".

He explained it a bit on a Twitter thread [1] where I brought it up a while back and it sounds like they didn't like the fact that he was using GitHub to host XML files for another service because of the traffic from crawlers it created.

[0] https://github.com/robots.txt

[1] https://twitter.com/ekansa/status/1137052076062650368

User-agent: *

Allow: /humans.txt

Disallow: /


I was totally expecting /my_shiny_metal_ass to have some kind of easter egg.

We used to have one a long time ago when we fronted with nginx and I could just throw a static file up there, but once we switched to everything being dynamic, it wasn't worth the effort to code up a route. :)

> Mostly just to publicly shame a bunch of people.

For sure, but that's great, shame can be very effective.

https://www.reddit.com/robots.txt replies Not Found over here

This URL was returning "Not Found" for me as well, but with an HTTP 501 status instead of the traditional HTTP 404. After investigating, the issue seems to be cookie-related:

  $ curl -sIXGET 'https://www.reddit.com/robots.txt' -H 'cookie: rseor3=true;' | head -1
  HTTP/2 501
Compare with:

  $ curl -sIXGET 'https://www.reddit.com/robots.txt' | head -1
  HTTP/2 200

Thanks, works from incognito mode in browsers also then :)

    # Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
    # and ignoring 429 ratelimit responses, claims to respect robots:
    # http://mj12bot.com/
    User-agent: MJ12bot
    Disallow: /
Coincidentally, I've just read more negative things about MJ12bot last week: http://boston.conman.org/2019/07/09.1

You can read the rest of my MJ12Bot saga: http://boston.conman.org/2019/07/09-12 My take: they are grossly incompetent at programming.

Turns out that if one wants something from people, one shouldn't start the first interaction by calling them "grossly incompetent"...

The best robots.txt for Majestic:

    iptables -A INPUT -s -j DROP

If only it was that easy. Last month MJ12Bot hit my site from 136 distinct IP addresses. If we drop the last octet, it's 120 unique class-C addresses, and if we drop the last two octets, then 43 unique class-B addresses (and why not---31 distinct class-A addresses). It's a distributed bot. Very hard to block, so I think I came out ahead by them no longer spidering my site.

Edit: Added count of class-A blocks.

There are dozens of such bots, ones that promise they honor robots.txt but spam your server with nonsensical requests, requests for pages that haven't existed in a decade and are happy to ignore rate limits.

To be honest, robots.txt is not for these kinds of bots. These kinds of bots are either malicious or incompetent. But more importantly, they're 100% useless to you as a website operator. They offer no SEO benefit, drive no significant traffic and simply consume resources.

The answer, sadly, is to hit them at the web server / load balancer / reverse proxy layer and just bruteforce all these bad actors away.

They'll never stop trying, though. Checking some NGINX logs for some of these bots that have been blocked for years, they still knock on the door over and over again.

bulk filtering like that only cements FAANG hegemony; agree its a problem, do not agree this is the solution

A whitelist of FAANG crawlers would cement their hegemony - a blacklist of known-badly-behaved crawlers doesn't.

I don't filter out anyone but bad actors. If you abide by robots.txt you're free to scrape my sites

Wikipedia's rationale to exclude articles for deletion pages is interesting:

# Folks get annoyed when XfD discussions end up the number 1 google hit for their name.

There are a lot of webpages with my name on them, and the Cuil search engine put the XfD discussion about deleting my Wikipedia page (because I'm not significant enough) on the first page of Cuil's results for my name. I was thrilled :-D

> User-agent: TeleportPro

I remember using it to download whole (small) websites on dialup and then to read offline.

In Wikipedia's case, they'd prefer you download a dump or use an app specifically designed to read Wikipedia offline. Rendering all that PHP is (or at least at one point was) expensive.

reference for above comment:


I, somewhat fancifully, keep two flashdrives with a wikipedia data dump on them "just in case"

I didn't know they still sell TeleportPro!

I haven't put up a robots.txt in years, they are utterly pointless, IMHO. Even google doesn't honor them(nor did any of the others, like Alexa, for many years now). Read you web server logs, you'll see googlebot crawling pages it should not, if it were honoring your robots.txt.

Have you checked the IPs to make sure they're actually Google? There's a strong incentive to fake being Google.

yes they IPs were owned by them, they even indexed pages they were asked not to. :x

I've always thought something like robots.txt was a bit silly when it's so easy to ignore.

robots.txt is supposed to be helpful to the robot too.

If you write a crawler, you probably don't want it to waste time indexing a list of articles in every possible sort order, trying all "reply" buttons, things like that.

For me, a "Disallow" line in robots.txt means "don't bother, nothing interesting here". It is a suggestion that benefits everyone when followed, not an access control list.

>If you write a crawler, you probably don't want it to waste time indexing a list of articles in every possible sort order, trying all "reply" buttons, things like that.

On the other hand, many websites (like wikipedia here) hide interesting pages behind a Disallow.

I think the concern is more accurately: you must go out of your way to honor robots.txt.

I think both robots.txt and security.txt are great ideas. However, they will always only be useful to those who follow the wishes of the website (which hopefully outweight those who do not).

Google, Bing, and other large search engines honor it. That alone is the point of most of the entries in this file.

robots.txt is the sheet with the house rules on the wall (or part thereof), not the enforcement of those rules.

Part of my monthly maintenance on an independent Mediawiki install is to cross-reference our robots.txt (which is based on Wikipedia's) and server logs.

If a client or IP range is misbehaving in the server logs, it goes into robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny list.

I've tried to automate that process a few times but haven't ever gotten far. It's unending, though. Feels like all it seems to take is a handful of cash and a few days to start an SEO marketing buzzword company with its own crawler, all to build yet another thing for us to block.

It's like putting up this sign in a public restroom (slightly nsfw):


It's a bit like "do not track". The people you want to use it for don't care and won't respect it (even if publicly they might say they will).

> Please do not remove the space at the start of this line, it breaks the rendering.

Does anyone know what this comment means? It is towards the end of the file.

Its because the robots.txt file is created by combining a static file ( https://github.com/wikimedia/operations-mediawiki-config/blo... ) with an on-wiki page so that different language wikipedias can manage the file themselves - https://en.wikipedia.org/wiki/MediaWiki:Robots.txt

In wikisyntax, starting a line with a space ensure's that its in a pre tag, so this was a way to make the onwiki page show the entire thing in a pre tag and not use normal formatting. This seems to be broken by actually starting a literal pre tag inside the first line, but I'm pretty sure this used to work as a way to display the entire thing in a pre tag.

For the curious, the code generating the robots.txt file is at https://github.com/wikimedia/operations-mediawiki-config/blo...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact