Hacker News new | past | comments | ask | show | jobs | submit login

We also need a robots.txt extension for publicly accessable file exclusion from AI training datasets. iirc there's a nascent ai.txt but not sure if anyone follows it (yet)



I don't think `robots.txt` works on the basis of the crawlers wanting to do this to be nice, or "socially responsible" or anything. So I don't hold up much hope that anything similar can happen again.

Early search engines had a problem, which was that when they crawled willy nilly, people would block their IP addresses. Inventing this concept of `robots.txt` worked because search engines wanted something: to avoid IP blocks, which they couldn't easily get around. And site hosts generally wanted to be indexed.

Today it's WAY harder to block relevant IP addresses, so site hosts generally can't easily block a crawler that wants its data: there is no compromise to be found here, and the imbalance of power is much stronger. And many site hosts generally don't want to be crawled for free for AI purposes at all. Pretty much anyone who sets up an `ai.txt` uses it to just reject all crawling, so there is no reason for any crawler to respect it.


Google ignores robots.txt as do many others. Try it yourself, setup a honeypot URL, don’t even link to it, just throw it in robots.txt, google bot will visit it at some point.


I discovered this years ago, and it's what made me start stop bothering with robots.txt and start blocking all the crawlers I can using .htaccess, including Google's.

That's a game of whack-a-mole that always lets a few miscreants through. I used to find that an acceptable amount of error until I learned that crawlers were gathering data to be used to train LLMs. That's a situation where even a single bot getting through is very problematic.

I still haven't found a solution to that aside from no longer allowing access to my sites without an account.


I think the closest thing is the NoAI and NoImageAI meta tags, which have some relatively prominent adoption.


robots.txt is useless as a defense mechanism (that isn't what it's trying to be). Taking the same approach for AI would likewise not be useful as a defense mechanism.


Haven't some companies explicitly ignored robots.txt to scrape the sites more quickly (and pissing off a number of people)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: