This seems unnecessarily complicated and antagonistic. Mostly just to publicly shame a bunch of people.
At Reddit we originally blocked a couple of crawlers but then realized how pointless that was. The entire robots file[0] is basically now just for google. All of the restrictions are enforced on the server side because there were so many bad bots it didn’t matter if we listed them or not.
GitHub has an interesting entry in theirs [0]. The only user that's in the Disallow section is "/ekansa".
He explained it a bit on a Twitter thread [1] where I brought it up a while back and it sounds like they didn't like the fact that he was using GitHub to host XML files for another service because of the traffic from crawlers it created.
We used to have one a long time ago when we fronted with nginx and I could just throw a static file up there, but once we switched to everything being dynamic, it wasn't worth the effort to code up a route. :)
This URL was returning "Not Found" for me as well, but with an HTTP 501 status instead of the traditional HTTP 404. After investigating, the issue seems to be cookie-related:
If only it was that easy. Last month MJ12Bot hit my site from 136 distinct IP addresses. If we drop the last octet, it's 120 unique class-C addresses, and if we drop the last two octets, then 43 unique class-B addresses (and why not---31 distinct class-A addresses). It's a distributed bot. Very hard to block, so I think I came out ahead by them no longer spidering my site.
There are dozens of such bots, ones that promise they honor robots.txt but spam your server with nonsensical requests, requests for pages that haven't existed in a decade and are happy to ignore rate limits.
To be honest, robots.txt is not for these kinds of bots. These kinds of bots are either malicious or incompetent. But more importantly, they're 100% useless to you as a website operator. They offer no SEO benefit, drive no significant traffic and simply consume resources.
The answer, sadly, is to hit them at the web server / load balancer / reverse proxy layer and just bruteforce all these bad actors away.
They'll never stop trying, though. Checking some NGINX logs for some of these bots that have been blocked for years, they still knock on the door over and over again.
There are a lot of webpages with my name on them, and the Cuil search engine put the XfD discussion about deleting my Wikipedia page (because I'm not significant enough) on the first page of Cuil's results for my name. I was thrilled :-D
In Wikipedia's case, they'd prefer you download a dump or use an app specifically designed to read Wikipedia offline. Rendering all that PHP is (or at least at one point was) expensive.
I haven't put up a robots.txt in years, they are utterly pointless, IMHO. Even google doesn't honor them(nor did any of the others, like Alexa, for many years now). Read you web server logs, you'll see googlebot crawling pages it should not, if it were honoring your robots.txt.
robots.txt is supposed to be helpful to the robot too.
If you write a crawler, you probably don't want it to waste time indexing a list of articles in every possible sort order, trying all "reply" buttons, things like that.
For me, a "Disallow" line in robots.txt means "don't bother, nothing interesting here". It is a suggestion that benefits everyone when followed, not an access control list.
>If you write a crawler, you probably don't want it to waste time indexing a list of articles in every possible sort order, trying all "reply" buttons, things like that.
On the other hand, many websites (like wikipedia here) hide interesting pages behind a Disallow.
I think the concern is more accurately: you must go out of your way to honor robots.txt.
I think both robots.txt and security.txt are great ideas. However, they will always only be useful to those who follow the wishes of the website (which hopefully outweight those who do not).
Part of my monthly maintenance on an independent Mediawiki install is to cross-reference our robots.txt (which is based on Wikipedia's) and server logs.
If a client or IP range is misbehaving in the server logs, it goes into robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny list.
I've tried to automate that process a few times but haven't ever gotten far. It's unending, though. Feels like all it seems to take is a handful of cash and a few days to start an SEO marketing buzzword company with its own crawler, all to build yet another thing for us to block.
In wikisyntax, starting a line with a space ensure's that its in a pre tag, so this was a way to make the onwiki page show the entire thing in a pre tag and not use normal formatting. This seems to be broken by actually starting a literal pre tag inside the first line, but I'm pretty sure this used to work as a way to display the entire thing in a pre tag.
At Reddit we originally blocked a couple of crawlers but then realized how pointless that was. The entire robots file[0] is basically now just for google. All of the restrictions are enforced on the server side because there were so many bad bots it didn’t matter if we listed them or not.
[0] https://www.reddit.com/robots.txt