
Common Robots. Txt Pitfalls and How to Avoid Them - AnnYaroshenko
https://jetoctopus.com/tech-problem/a50-robots-txt-pitfalls-and-how-to-avoid.html
======
ktpsns
I never have seen a good use-case for allowing one search engine not to index
the page while not allowing another one. Neither today, nor 20yrs ago. For the
Google and As(sense?) related stuff, using robots.txt for controlling their
behaviour seems to me like a bad API. Given that Google offers fancy webmaster
consoles to change the appearing on their site since years.

Really, the only use case for distinguishing robots I ever stumbled upon was,
as a client, blacklisting wget (looking at you, ArXiv.org). But then we just
switch to curl or let wget ignore robots.txt, but with being nice to the
server/network load.

------
AnnYaroshenko
1\. Ignoring disallow directives for specific user-agent block; 2\. One
robots.txt file for different subdomains; 3\. Listing of secure directories;
4\. Blocking relevant pages; 5\. Forgetting to add directives for specific
bots where it’s needed; 6\. Adding the relative path to sitemap; 7\. Ignoring
Slash in a Disallow field; 8\. Forgetting about case sensitivity.

