Hacker News new | past | comments | ask | show | jobs | submit login

    # Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
    # and ignoring 429 ratelimit responses, claims to respect robots:
    # http://mj12bot.com/
    User-agent: MJ12bot
    Disallow: /
Coincidentally, I've just read more negative things about MJ12bot last week: http://boston.conman.org/2019/07/09.1

You can read the rest of my MJ12Bot saga: http://boston.conman.org/2019/07/09-12 My take: they are grossly incompetent at programming.

Turns out that if one wants something from people, one shouldn't start the first interaction by calling them "grossly incompetent"...

The best robots.txt for Majestic:

    iptables -A INPUT -s -j DROP

If only it was that easy. Last month MJ12Bot hit my site from 136 distinct IP addresses. If we drop the last octet, it's 120 unique class-C addresses, and if we drop the last two octets, then 43 unique class-B addresses (and why not---31 distinct class-A addresses). It's a distributed bot. Very hard to block, so I think I came out ahead by them no longer spidering my site.

Edit: Added count of class-A blocks.

There are dozens of such bots, ones that promise they honor robots.txt but spam your server with nonsensical requests, requests for pages that haven't existed in a decade and are happy to ignore rate limits.

To be honest, robots.txt is not for these kinds of bots. These kinds of bots are either malicious or incompetent. But more importantly, they're 100% useless to you as a website operator. They offer no SEO benefit, drive no significant traffic and simply consume resources.

The answer, sadly, is to hit them at the web server / load balancer / reverse proxy layer and just bruteforce all these bad actors away.

They'll never stop trying, though. Checking some NGINX logs for some of these bots that have been blocked for years, they still knock on the door over and over again.

bulk filtering like that only cements FAANG hegemony; agree its a problem, do not agree this is the solution

A whitelist of FAANG crawlers would cement their hegemony - a blacklist of known-badly-behaved crawlers doesn't.

I don't filter out anyone but bad actors. If you abide by robots.txt you're free to scrape my sites

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact